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One  of  the  main  goals  of  natural  language  processing  (NLP)  is  to  build  au¬ 
tomated  systems  that  can  understand  and  generate  human  languages.  This  goal  has 
so  far  remained  elusive.  Existing  hand-crafted  systems  can  provide  in-depth  anal¬ 
ysis  of  domain  sub-languages,  but  are  often  notoriously  fragile  and  costly  to  build. 
Existing  machine-learned  systems  are  considerably  more  robust,  but  are  limited  to 
relatively  shallow  NLP  tasks. 

In  this  thesis,  we  present  novel  statistical  methods  for  robust  natural  lan¬ 
guage  understanding  and  generation.  We  focus  on  two  important  sub-tasks,  seman¬ 
tic  parsing  and  tactical  generation.  The  key  idea  is  that  both  tasks  can  be  treated  as 
the  translation  between  natural  languages  and  formal  meaning  representation  lan¬ 
guages,  and  therefore,  can  be  performed  using  state-of-the-art  statistical  machine 
translation  techniques.  Specifically,  we  use  a  technique  called  synchronous  pars¬ 
ing,  which  has  been  extensively  used  in  syntax-based  machine  translation,  as  the 
unifying  framework  for  semantic  parsing  and  tactical  generation.  The  parsing  and 


generation  algorithms  learn  all  of  their  linguistic  knowledge  from  annotated  cor¬ 
pora,  and  can  handle  natural-language  sentences  that  are  conceptually  complex. 

A  nice  feature  of  our  algorithms  is  that  the  semantic  parsers  and  tactical  gen¬ 
erators  share  the  same  learned  synchronous  grammars.  Moreover,  charts  are  used  as 
the  unifying  language-processing  architecture  for  efficient  parsing  and  generation. 
Therefore,  the  generators  are  said  to  be  the  inverse  of  the  parsers,  an  elegant  prop¬ 
erty  that  has  been  widely  advocated.  Furthermore,  we  show  that  our  parsers  and 
generators  can  handle  formal  meaning  representation  languages  containing  logical 
variables,  including  predicate  logic. 

Our  basic  semantic  parsing  algorithm  is  called  Wasp.  Most  of  the  other 
parsing  and  generation  algorithms  presented  in  this  thesis  are  extensions  of  Wasp 
or  its  inverse.  We  demonstrate  the  effectiveness  of  our  parsing  and  generation  al¬ 
gorithms  by  performing  experiments  in  two  real-world,  restricted  domains.  Ex¬ 
perimental  results  show  that  our  algorithms  are  more  robust  and  accurate  than  the 
currently  best  systems  that  require  similar  supervision.  Our  work  is  also  the  first 
attempt  to  use  the  same  automatically-learned  grammar  for  both  parsing  and  gen¬ 
eration.  Unlike  previous  systems  that  require  manually-constructed  grammars  and 
lexicons,  our  systems  require  much  less  knowledge  engineering  and  can  be  easily 
ported  to  other  languages  and  domains. 
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Chapter  1 


Introduction 


An  indicator  of  machine  intelligence  is  the  ability  to  converse  in  human 
languages  (Turing,  1950).  One  of  the  main  goals  of  natural  language  processing 
(NLP)  as  a  sub-field  of  artificial  intelligence  is  to  build  automated  systems  that  can 
understand  and  generate  human  languages.  This  goal  has  so  far  remained  elusive. 
Manually-constructed  knowledge-based  systems  can  understand  and  generate  do¬ 
main  sub-languages,  but  are  notoriously  fragile  and  costly  to  build.  Statistical  meth¬ 
ods  are  considerably  more  robust,  but  are  limited  to  relatively  shallow  NLP  tasks 
such  as  part-of-speech  tagging,  syntactic  parsing,  and  word  sense  disambiguation. 
Robust,  broad-coverage  NLP  systems  that  are  capable  of  understanding  and  gener¬ 
ating  human  languages  are  still  beyond  reach. 

Recent  advances  in  information  retrieval  seem  to  suggest  that  automated 
systems  can  appear  to  be  intelligent  without  any  deep  understanding  of  human  lan¬ 
guages.  However,  the  success  of  Internet  search  engines  critically  depends  on  the 
redundancy  of  natural  language  expressions  in  Web  documents.  For  example,  given 
the  following  search  query: 

Why  do  radio  stations  ’  names  start  with  W? 

Google  returns  a  link  to  the  following  Web  document  that  contains  the  relevant 
information:1 

'The  search  was  performed  in  July  2007.  URL  of  Google:  http :  //www .  google  .  com/ 


1 


Answer  “Why  do  us  eastern  radio  station  names  start  with  W  ex¬ 
cept  KDKA  KYW  and  KQV  and  western  station  names  start  with  K 
except  WIBW  and  WHO?”... 

Note  that  this  document  contains  an  expression  that  is  almost  identical  to  the  search 
query.  In  contrast,  when  given  rare  queries  such  as: 

Does  Germany  border  China? 

search  engines  such  as  Google  would  have  difficulty  finding  Web  documents  that 
contain  the  search  query.  This  leads  to  poor  search  results: 

The  Break-up  of  Communism  in  East  Germany  and  Eastern  Europe.  ... 

Kuo  does  not,  however,  provide  a  comprehensive  treatment  of  China’s... 

To  answer  this  query  would  require  spatial  reasoning,  which  is  impossible  unless 
the  query  is  correctly  understood. 

Similar  arguments  can  be  made  for  other  NLP  tasks  such  as  machine  trans¬ 
lation,  which  is  the  translation  between  natural  languages.  Current  statistical  ma¬ 
chine  translation  systems  typically  depend  on  the  redundancy  of  translation  pairs 
in  the  training  corpora.  When  given  rare  sentences  such  as  Does  Germany  border 
China?,  machine  translation  systems  would  have  difficulty  composing  good  trans¬ 
lations  for  them.  Such  reliance  on  redundancy  may  be  reduced  by  using  meaning 
representations  that  are  more  compact  than  natural  languages.  This  would  require 
the  machine  translators  being  able  to  understand  the  source  language  as  well  as 
generate  the  target  language. 

In  this  thesis,  we  will  present  novel  statistical  methods  for  robust  natural 
language  understanding  and  generation.  We  will  focus  on  two  important  sub-tasks, 
semantic  parsing  and  tactical  generation. 


2 


1.1  Semantic  Parsing 

Semantic  parsing  is  the  task  of  transforming  natural-language  sentences  into 
complete,  formal,  symbolic  meaning  representations  (MR)  suitable  for  automated 
reasoning  or  further  processing.  It  is  an  integral  part  of  natural  language  inter¬ 
faces  to  databases  (Androutsopoulos  et  al.,  1995).  For  example,  in  the  Geoquery 
database  (Zelle  and  Mooney,  1996),  a  semantic  parser  is  used  to  transform  natu¬ 
ral  language  queries  into  formal  queries.  Below  is  a  sample  English  query,  and  its 
corresponding  Prolog  logical  form: 

What  is  the  smallest  state  by  area? 

answer  ( x i ,  smallest  (x2,  (state  (aq)  ,  area  (aq,  x2)  )  )  ) 

This  Prolog  logical  form  would  be  used  to  retrieve  an  answer  to  the  English  query 
from  the  Geoquery  database.  Other  potential  uses  of  semantic  parsing  include 
machine  translation  (Nyberg  and  Mitamura,  1992),  document  summarization  (Mani, 
2001),  question  answering  (Friedland  et  al.,  2004),  command  and  control  (Simmons 
et  al.,  2003),  and  interfaces  to  advice-taking  agents  (Kuhlmann  et  al.,  2004). 

1.2  Natural  Language  Generation 

Natural  language  generation  is  the  task  of  constructing  natural-language 
sentences  from  computer-internal  representations  of  information.  It  can  be  divided 
into  two  sub-tasks:  (1)  strategic  generation ,  which  decides  what  meanings  to  ex¬ 
press,  and  (2)  tactical  generation,  which  generates  natural-language  expressions  for 
those  meanings.  This  thesis  is  focused  on  the  latter  task  of  tactical  generation.  One 
of  the  earliest  motivating  applications  for  natural  language  generation  is  machine 
translation  (Yngve,  1962;  Wilks,  1973).  It  is  also  an  important  component  of  dialog 
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systems  (Oh  and  Rudnicky,  2000)  and  automatic  summarizers  (Mani,  2001).  For 
example,  in  the  CMU  Communicator  travel  planning  system  (Oh  and  Rudnicky, 
2000),  the  input  to  the  tactical  generation  component  is  a  frame  of  attribute-value 
pairs: 

act  QUERY 

content  DEPART-TIME 

depart-city  New  York 

The  output  of  the  tactical  generator  would  be  a  natural  language  sentence  that  ex¬ 
presses  the  meaning  represented  by  the  input  frame: 

What  time  would  you  like  to  leave  New  York? 

1.3  Thesis  Contributions 

Much  of  the  early  research  on  semantic  parsing  and  tactical  generation  was 
focused  on  hand-crafted  knowledge-based  systems  that  require  tedious  amounts  of 
domain- specific  knowledge  engineering.  As  a  result,  these  systems  are  often  too 
brittle  for  general  use,  and  cannot  be  easily  ported  to  other  application  domains.  In 
response  to  this,  various  machine  learning  approaches  to  semantic  parsing  and  tacti¬ 
cal  generation  have  been  proposed  since  the  mid-1990’s.  Regarding  these  machine 
learning  approaches,  a  few  observations  can  be  made: 

1 .  Many  of  the  statistical  learning  algorithms  for  semantic  parsing  are  designed 
for  simple  domains  in  which  sentences  can  be  represented  by  a  single  seman¬ 
tic  frame  (e.g.  Miller  et  al.,  1996). 

2.  Other  learning  algorithms  for  semantic  parsing  that  can  handle  complex  sen¬ 
tences  are  based  on  inductive  logic  programming  or  deterministic  parsing, 
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which  lack  the  robustness  that  characterizes  statistical  learning  (e.g.  Zelle 
and  Mooney,  1996). 

3.  While  tactical  generators  enhanced  with  machine-learned  components  are 
generally  more  robust  than  their  non-machine-learned  counterparts,  most,  if 
not  all,  are  still  dependent  on  manually-constructed  grammars  and  lexicons 
that  are  very  difficult  to  maintain  (e.g.  Carroll  and  Oepen,  2005). 

In  this  thesis,  we  present  a  number  of  novel  statistical  learning  algorithms  for  se¬ 
mantic  parsing  and  tactical  generation.  These  algorithms  automatically  learn  all  of 
their  linguistic  knowledge  from  annotated  corpora,  and  can  handle  natural-language 
sentences  that  are  conceptually  complex.  The  resulting  parsers  and  generators  are 
more  robust  and  accurate  than  the  currently  best  methods  requiring  similar  super¬ 
vision,  based  on  experiments  in  four  natural  languages  and  in  two  real-world,  re¬ 
stricted  domains. 

The  key  idea  of  this  thesis  is  that  both  semantic  parsing  and  tactical  genera¬ 
tion  are  treated  as  language  translation  tasks.  In  other  words: 

1.  Semantic  parsing  can  be  defined  as  the  translation  from  a  natural  language 
(NL)  into  a  formal  meaning  representation  language  (MRL). 

2.  Tactical  generation  can  be  defined  as  the  translation  from  a  formal  MRL  into 
an  NL. 

Both  tasks  are  performed  using  state-of-the-art  statistical  machine  translation  tech¬ 
niques.  Specifically,  we  use  a  technique  called  synchronous  parsing.  Originally 
introduced  by  Aho  and  Ullman  (1972)  to  model  the  translation  between  formal 
languages,  synchronous  parsing  has  recently  been  used  to  model  the  translation  be¬ 
tween  NLs  (Yamada  and  Knight,  2001;  Chiang,  2005).  We  show  that  synchronous 
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parsing  can  be  used  to  model  the  translation  between  NLs  and  MRLs  as  well.  More¬ 
over,  the  resulting  semantic  parsers  and  tactical  generators  share  the  same  learned 
synchronous  grammars,  and  charts  are  used  as  the  unifying  language-processing 
architecture  for  efficient  parsing  and  generation.  Therefore,  the  generators  are  said 
to  be  the  inverse  of  the  parsers,  an  elegant  property  that  has  been  noted  by  a  number 
of  researchers  (e.g.  Shieber,  1988). 

In  addition,  we  show  that  the  synchronous  parsing  framework  can  handle 
a  variety  of  formal  MRLs.  We  present  two  sets  of  semantic  parsing  and  tactical 
generation  algorithms  for  different  types  of  MRLs,  one  for  MRLs  that  are  variable- 
free,  one  for  MRLs  that  contain  logical  variables,  such  as  predicate  logic.  Both  sets 
of  algorithms  are  shown  to  be  effective  in  their  respective  application  domains. 

1.4  Thesis  Outline 

Below  is  a  summary  of  the  remaining  chapters  of  this  thesis: 

•  In  Chapter  2,  we  provide  a  brief  overview  of  semantic  parsing,  natural  lan¬ 
guage  generation,  statistical  machine  translation,  and  synchronous  parsing. 
We  also  describe  the  application  domains  that  will  be  considered  in  subse¬ 
quent  chapters. 

•  In  Chapter  3,  we  describe  how  semantic  parsing  can  be  done  using  statistical 
machine  translation.  We  present  a  semantic  parsing  algorithm  called  Wasp, 
short  for  Word  Alignment-based  Semantic  Parsing.  This  chapter  is  focused 
on  variable-free  MRLs. 

•  In  Chapter  4,  we  extend  the  Wasp  semantic  parsing  algorithm  to  handle  target 
MRLs  with  logical  variables.  The  resulting  algorithm  is  called  A-Wasp. 
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•  In  Chapter  5,  we  describe  how  tactical  generation  can  be  done  using  statistical 
machine  translation.  We  present  results  on  using  a  recent  phrase-based  statis¬ 
tical  machine  translation  system,  Pharaoh  (Koehn  et  ah,  2003),  for  tactical 
generation.  We  also  present  Wasp-1,  which  is  the  inverse  of  the  Wasp  se¬ 
mantic  parser,  and  two  hybrid  systems,  PHARAOH++  and  Wasp-1++.  Among 
the  four  systems,  Wasp-1++  is  shown  to  be  provide  the  best  overall  perfor¬ 
mance.  This  chapter  is  focused  on  variable-free  MRLs. 

•  In  Chapter  6,  we  extend  the  Wasp-1++  tactical  generation  algorithm  to  han¬ 
dle  source  MRLs  with  logical  variables.  The  resulting  algorithm  is  called 
A-Wasp-1++. 

•  In  Chapter  7,  we  show  some  preliminary  results  for  interlingual  machine 
translation ,  an  approach  to  machine  translation  that  integrates  natural  lan¬ 
guage  understanding  and  generation.  We  also  discuss  the  prospect  of  natu¬ 
ral  language  understanding  and  generation  for  unrestricted  texts,  and  suggest 
several  possible  future  research  directions  toward  this  goal. 

•  In  Chapter  8,  we  conclude  this  thesis. 

Figure  1.1  summarizes  the  various  algorithms  presented  in  this  thesis. 

Some  of  the  work  presented  in  this  thesis  has  been  previously  published. 

Material  presented  in  Chapters  3,  4  and  5  appeared  in  Wong  and  Mooney  (2006), 

Wong  and  Mooney  (2007b)  and  Wong  and  Mooney  (2007a),  respectively. 
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Figure  1.1:  The  parsing  and  generation  algorthms  presented  in  this  thesis 
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Chapter  2 


Background 


This  thesis  encompasses  several  areas  of  NLP:  semantic  parsing  (or  natu¬ 
ral  language  understanding),  natural  language  generation,  and  machine  translation. 
These  areas  have  traditionally  formed  separate  research  communities,  to  some  de¬ 
gree  isolated  from  each  other.  In  this  chapter,  we  provide  a  brief  overview  of  these 
three  areas  of  research.  We  also  provide  background  on  synchronous  parsing  and 
synchronous  grammars,  which  we  claim  can  form  a  unifying  framework  for  these 
NLP  tasks. 

2.1  Application  Domains 

First  of  all,  we  review  the  application  domains  that  will  be  considered  in 
subsequent  sections.  Our  main  focus  is  on  application  domains  that  have  been  used 
for  evaluating  semantic  parsers.  These  domains  will  be  re-used  for  evaluating  tac¬ 
tical  generators  (Section  5.2)  and  interlingual  machine  translation  systems  (Section 
7.1). 

Much  work  on  learning  for  semantic  parsing  has  been  done  in  the  context  of 
spoken  language  understanding  (SLU)  (Wang  et  al.,  2005).  Among  the  application 
domains  developed  for  benchmarking  SLU  systems,  the  Atis  (Air  Travel  Informa¬ 
tion  Services)  domain  is  probably  the  most  well-known  (Price,  1990).  The  Atis 
corpus  consists  of  spoken  queries  that  were  elicited  by  presenting  human  subjects 


9 


with  various  hypothetical  travel  planning  scenarios  to  solve.  The  resulting  spon¬ 
taneous  spoken  queries  were  recorded  as  the  subjects  interacted  with  automated 
dialog  systems  to  solve  the  scenarios.  The  recorded  speech  was  transcribed  and 
annotated  with  SQL  queries  and  reference  answers.  Below  is  a  sample  transcribed 
query  with  its  SQL  annotation: 

Show  me  flights  from  Boston  to  New  York. 

SELECT  f ilght_id  FROM  flight  WHERE 
from_airport  =  'boston' 

AND  to_airport  =  'new  york' 

The  Atis  corpus  exhibits  a  wide  range  of  interesting  phenomena  often  associated 
with  spontaneous  speech,  such  as  verbal  deletion  and  flexible  word  order.  However, 
we  will  not  focus  on  this  domain  in  this  thesis,  because  the  SQL  annotations  tend  to 
be  quite  messy,  and  it  takes  a  lot  of  human  effort  to  transform  the  SQL  annotations 
into  a  usable  form.1  Also  most  Atis  queries  are  in  fact  conceptually  very  simple, 
and  semantic  parsing  often  amounts  to  slot  filling  of  a  single  semantic  frame  (Kuhn 
and  De  Mori,  1995;  Popescu  et  al.,  2004).  We  mention  this  domain  because  much 
of  the  existing  work  described  in  Section  2.2  was  developed  for  the  Atis  domain. 

In  this  thesis,  we  focus  on  the  following  two  domains.  The  first  one  is 
Geoquery.  The  aim  of  this  domain  is  to  develop  an  NL  interface  to  a  U.S.  geog¬ 
raphy  database  written  in  Prolog.  This  database  was  part  of  the  Turbo  Prolog  2.0 
distribution  (Borland  International,  1988).  The  query  language  is  basically  first- 
order  Prolog  logical  forms,  augmented  with  several  meta-predicates  for  dealing 

1  None  of  the  existing  Atis  systems  that  we  are  aware  of  use  SQL  directly.  Instead,  they  use  inter¬ 
mediate  languages  such  as  predicate  logic  (Zettlemoyer  and  Collins,  2007)  which  are  then  translated 
into  SQL  using  external  tools. 
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with  quantification  (Zelle  and  Mooney,  1996).  The  Geoquery  corpus  consists 
of  written  English,  Spanish,  Japanese  and  Turkish  queries  gathered  from  various 
sources.  All  queries  were  annotated  with  Prolog  logical  forms.  Below  is  a  sample 
English  query  and  its  Prolog  annotation: 

What  states  does  the  Ohio  run  through? 

answer  (aq,  (state  (aq)  ,  traverse  (aq,  aq)  , 
equal  (aq,  riverid  (ohio)  )  )  ) 

Note  that  the  logical  variables  aq  and  aq  are  used  to  denote  entities.  In  this  log¬ 
ical  form,  state  is  a  predicate  that  returns  true  if  its  argument  (aq)  denotes  a 
U.S.  state,  and  traverse  is  a  predicate  that  returns  true  if  its  first  argument 
(aq),  which  is  a  river,  traverses  its  second  argument  (aq),  which  is  usually  a  state. 
The  equal  predicate  returns  true  if  its  first  argument  (aq)  denotes  the  Ohio  river 
(riverid  (ohio) ).  Finally,  the  logical  variable  aq  denotes  the  answer  (answer) 
to  the  query.  In  this  domain,  queries  typically  show  a  deeply  nested  structure,  which 
makes  the  semantic  parsing  task  rather  challenging,  e.g.: 

What  states  border  the  states  that  the  Ohio  runs  through? 

What  states  border  the  state  that  borders  the  most  states? 

For  semantic  parsers  that  cannot  deal  with  logical  variables  (e.g.  Ge  and  Mooney, 
2006;  Kate  and  Mooney,  2006),  a  functional,  variable-free  query  language  (FunQL) 
has  been  developed  for  this  domain  (Kate  et  al.,  2005).  In  FunQL,  each  predicate 
can  be  seen  to  have  a  set-theoretic  interpretation.  For  example,  in  the  FunQF 
equivalent  of  the  Prolog  logical  form  shown  above: 

answer (state (traverse_l (riverid (ohio) ) ) ) 
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the  term  river  (ohio)  denotes  a  singleton  set  that  consists  of  the  Ohio  river, 
traverse_l  denotes  the  set  of  entities  that  some  of  the  members  of  its  argument 
(which  are  rivers)  run  through2,  and  state  denotes  the  subset  of  its  argument 
whose  members  are  also  U.S.  states. 

The  second  domain  that  we  consider  is  RoboCup.  RoboCup  (http  :  // 
www .  robocup  .  org/)  is  an  international  AI  research  initiative  that  uses  robotic 
soccer  as  its  primary  domain.  In  the  RoboCup  Coach  Competition,  teams  of  au¬ 
tonomous  agents  compete  on  a  simulated  soccer  field,  receiving  advice  from  a  team 
coach  using  a  formal  language  called  CLANG  (Chen  et  al.,  2003).  Our  specific  aim 
is  to  develop  an  NL  interface  for  autonomous  agents  to  understand  NL  advice.  The 
RoboCup  corpus  consists  of  formal  CLANG  advice  mined  from  previous  Coach 
Competition  game  logs,  annotated  with  English  translations.  Below  is  a  piece  of 
CLANG  advice  and  its  English  gloss: 

(  (bowner  our  {4}) 

(do  our  {6}  (pos  (left  (half  our))))) 

If  our  player  4  has  the  ball,  then  our  player  6  should  stay  in  the  left 
side  of  our  half. 

In  CLANG,  tactics  are  generally  expressed  in  the  form  of  if-then  rules.  Here  the  ex¬ 
pression  (bowner  .  .  .  )  represents  the  “ball  owner”  condition,  and  (do  .  .  .  ) 
is  a  directive  that  is  followed  when  the  condition  holds,  i.e.  player  6  should  position 
itself  (pos)  in  the  left  side  (left)  of  our  half  ( (half  our) ). 

Appendix  A  provides  detailed  specifications  of  all  formal  meaning  represe- 
nation  languages  (MRL)  being  considered:  the  GEOQUERY  logical  query  language, 

2On  the  other  hand,  traverse!  is  the  inverse  of  traversed,  i.e.  it  denotes  the  set  of  rivers 
that  run  through  some  of  the  members  of  its  argument  (which  are  usually  cities  or  U.S.  states). 
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FunQL,  and  CLANG. 


2.2  Semantic  Parsing 

Semantic  parsing  is  a  research  area  with  a  long  history.  Many  early  seman¬ 
tic  parsers  are  NL  interfaces  to  databases,  including  Lunar  (Woods  et  al.,  1972), 
Chat-80  (Warren  and  Pereira,  1982),  and  Tina  (Seneff,  1992).  These  NL  inter¬ 
faces  are  often  hand-crafted  for  a  particular  database,  and  cannot  be  easily  ported 
to  other  domains.  Over  the  last  decade,  various  data-driven  approaches  to  seman¬ 
tic  parsing  have  been  proposed.  These  algorithms  often  produce  semantic  parsers 
that  are  more  robust  and  accurate,  and  tend  to  be  less  application-specific  than  their 
hand-crafted  counterparts.  In  this  section,  we  provide  a  brief  overview  of  these 
learning  approaches. 

2.2.1  Syntax-Based  Approaches 

One  of  the  earliest  data-driven  approaches  to  semantic  parsing  is  based  on 
the  idea  of  augmenting  statistical  syntactic  parsers  with  semantic  labels.  Miller  et  al. 
(1994)  propose  the  hierarchical  Hidden  Understanding  Model  (HUM)  in  which 
context-free  grammar  (CFG)  rules  are  learned  from  an  annotated  corpus  consist¬ 
ing  of  augmented  parse  trees.  Figure  2.1  shows  a  sample  augmented  parse  tree  in 
the  Atis  domain.  Here  the  non-terminal  symbols  Flight,  Stop  and  City  repre¬ 
sent  domain- specific  concepts,  while  other  non-terminal  symbols  such  as  NP  (noun 
phrase)  and  VP  (verb  phrase)  are  syntactic  categories.  Given  an  input  sentence,  a 
parser  based  on  a  probabilistic  recursive  transition  network  is  used  to  find  the  best 
augmented  parse  tree.  This  tree  is  then  converted  into  a  non-recursive  semantic 
frame  using  a  probabilistic  semantic  interpretation  model  (Miller  et  al.,  1996). 
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Figure  2.1:  An  augmented  parse  tree  taken  from  Miller  et  al.  (1994) 


Ge  and  Mooney  (2005,  2006)  present  another  algorithm  using  augmented 
parse  trees  called  SCISSOR.  It  is  an  improvement  over  HUM  in  three  respects. 
First,  it  is  based  on  a  state-of-the-art  statistical  lexicalized  parser  (Bikel,  2004). 
Second,  it  handles  meaning  representations  (MR)  that  are  deeply  nested,  which 
are  typical  in  the  Geoquery  and  RoboCup  domains.  Third,  a  discriminative  re¬ 
ranking  model  is  used  for  incorporating  non-local  features.  Again,  training  requires 
fully-annotated  augmented  parse  trees. 

The  main  drawback  of  HUM  and  SCISSOR  is  that  they  require  augmented 
parse  trees  for  training  which  are  often  very  difficult  to  obtain.  Zettlemoyer  and 
Collins  (2005)  address  this  problem  by  treating  parse  trees  as  hidden  variables 
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which  must  be  estimated  using  expectation-maximization  (EM).  Their  method  is 
based  on  a  combinatory  categorial  grammar  (CCG)  (Steedman,  2000).  The  key 
idea  is  to  first  over-generate  a  CCG  lexicon  using  a  small  set  of  language-specific 
template  rules.  For  example,  consider  the  following  template  rule: 

Input  trigger:  any  binary  predicate  p 
Output  category:  (S\NP)/NP  :  Xx1.\x2.p(x2,  X\) 

Suppose  we  are  given  a  training  sentence,  Utah  borders  Idaho ,  and  its  logical  form, 
borders  (utah,  idaho) .  The  binary  predicate  borders  would  trigger  the 
above  template  rule,  producing  a  lexical  item  for  each  word  in  the  sentence: 

Utah  :=  (S\NP)/NP  :  \x\.\x2- borders  {x2,  xO 
borders  :=  (S\NP)/NP  :  Aaq.Aa^-borders  ( X2,X\ ) 

Idaho  :=  (S\NP)/NP  :  \x\.\x2- borders  {X2,X\) 

Next,  spurious  lexical  items  such  as  Utah  and  Idaho  are  pruned  away  during  the 
parameter  estimation  phase,  where  log-linear  parameters  are  learned.  A  later  ver¬ 
sion  of  this  work  (Zettlemoyer  and  Collins,  2007)  uses  a  relaxed  CCG  for  dealing 
with  flexible  word  order  and  other  speech-related  phenomena,  as  exemplified  by  the 
Atis  domain.  Note  that  both  CCG-based  algorithms  require  prior  knowledge  of  the 
NL  syntax  in  the  form  of  template  rules  for  training. 

2.2.2  Semantic  Grammars 

A  common  feature  of  syntax-based  approaches  is  to  generate  full  syntactic 
parse  trees  together  with  semantic  parses.  This  is  often  a  more  elaborate  struc¬ 
ture  than  needed.  One  way  to  simplify  the  output  is  to  remove  syntactic  labels 
from  parse  trees.  This  results  in  a  semantic  grammar  (Allen,  1995),  in  which  non¬ 
terminal  symbols  correspond  to  domain-specific  concepts  as  opposed  to  syntactic 
categories.  A  sample  semantic  parse  tree  is  shown  in  Figure  2.2. 
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Figure  2.2:  A  semantic  parse  tree  for  the  sentence  in  Figure  2.1 


Several  algorithms  for  learning  semantic  grammars  have  been  devised.  Kate 
et  al.  (2005)  present  a  bottom-up  learning  algorithm  called  Silt.  The  key  idea  is 
to  re-use  the  non-terminal  symbols  provided  by  a  domain- specific  MRL  grammar 
(see  Appendix  A).  Each  production  in  the  MRL  grammar  corresponds  to  a  domain- 
specific  concept.  Given  a  training  set  consisting  of  NL  sentences  and  their  correct 
MRs,  context-free  parsing  rules  are  learned  for  each  concept,  starting  with  rules 
that  appear  in  the  leaves  of  a  semantic  parse  (e.g.  City  — >  Pittsburgh),  followed 
by  rules  that  appear  one  level  higher  (e.g.  Stop  — >  stop  in  City),  and  so  on.  The 
result  is  a  semantic  grammar  that  covers  the  training  set. 

More  recently,  Kate  and  Mooney  (2006)  present  an  algorithm  called  Krisp 
based  on  string  kernels.  Instead  of  learning  individual  context-free  parsing  rules  for 
each  domain- specific  concept,  Krisp  learns  a  support  vector  machine  (SVM)  clas¬ 
sifier  with  string  kernels  (Lodhi  et  al.,  2002).  The  kernel-based  classifier  essentially 
assigns  weights  to  all  possible  word  subsequences  up  to  a  certain  length,  so  that  sub¬ 
sequences  correlated  with  the  specific  concept  receive  higher  weights.  The  learned 
model  is  thus  equivalent  to  a  weighted  semantic  grammar  with  many  context-free 
parsing  rules.  It  is  shown  that  Krisp  is  more  robust  than  other  semantic  parsers  in 
the  face  of  noisy  input  sentences. 
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In  Chapters  3  and  4,  we  will  introduce  two  semantic  parsing  algorithms, 
Wasp  and  A-Wasp,  which  learn  semantic  grammars  from  annotated  corpora  using 
statistical  machine  translation  techniques. 

2.2.3  Other  Approaches 

Various  other  learning  approaches  have  been  proposed  for  semantic  parsing. 
Kuhn  and  De  Mori  (1995)  introduce  a  system  called  CHANEL  that  translates  NL 
queries  into  SQL  based  on  classifications  given  by  learned  decision  trees.  Each 
decision  tree  decides  whether  to  include  a  particular  attribute  or  constraint  in  the 
output  SQL  query.  Chanel  has  been  deployed  in  the  Atis  domain  where  queries 
are  often  conceptually  simple. 

Zelle  and  Mooney  (1996)  present  a  system  called  Chill  which  is  based 
on  inductive  logic  programming  (ILP).  It  learns  a  deterministic  shift-reduce  parser 
from  an  annotated  corpus  given  a  bilingual  lexicon,  which  can  be  either  hand¬ 
crafted  or  automatically  acquired  (Thompson  and  Mooney,  1999).  COCKTAIL 
(Tang  and  Mooney,  2001)  is  an  extension  of  Chill  that  shows  better  coverage 
through  the  use  of  multiple  clause  constructors. 

Papineni  et  al.  (1997)  and  Macherey  et  al.  (2001)  are  two  semantic  pars¬ 
ing  algorithms  using  machine  translation.  Both  algorithms  translate  English  Atis 
queries  into  formal  queries  as  if  the  target  language  were  a  natural  language.  Pa¬ 
pineni  et  al.  (1997)  is  based  on  a  discriminatively-trained,  word-based  translation 
model  (Section  2.5.1),  while  Macherey  et  al.  (2001)  is  based  on  a  phrase-based 
translation  model  (Section  2.5.2).  Unlike  these  algorithms,  our  Wasp  and  A-Wasp 
algorithms  are  based  on  syntax-based  translation  models  (Section  2.5.2). 

He  and  Young  (2003,  2006)  propose  the  Hidden  Vector  State  (HVS)  model, 
which  is  an  extension  of  the  hidden  Markov  model  (HMM)  with  stack-oriented  state 
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vectors.  It  can  capture  the  hierarchical  structure  of  sentences,  while  being  more 
constrained  than  CFGs.  It  has  been  deployed  in  various  SLU  systems  including 
Atis,  and  is  shown  to  be  quite  robust  to  input  noise. 

Wang  and  Acero  (2003)  propose  an  extended  HMM  model  for  the  Atis  do¬ 
main,  where  a  multiple-word  segment  is  generated  from  each  underyling  Markov 
state  that  corresponds  to  a  domain-specific  semantic  slot.  These  segments  corre¬ 
spond  to  slot  fillers  such  as  dates  and  times,  for  which  CFGs  are  written.  Then  a 
learned  HMM  serves  to  glue  together  different  slot  fillers  to  form  a  complete  se¬ 
mantic  interpretation. 

Lastly,  Precise  (Popescu  et  al.,  2003,  2004)  is  a  knowledge-intensive  ap¬ 
proach  to  semantic  parsing  that  does  not  involve  any  learning.  It  introduces  the 
notion  of  semantically  tractable  sentences,  sentences  that  give  rise  to  a  unique  se¬ 
mantic  interpretation  given  a  hand-crafted  lexicon  and  a  set  of  semantic  constraints. 
Interestingly,  Popescu  et  al.  (2004)  shows  that  over  90%  of  the  context-independent 
Atis  queries  are  semantically  tractable,  whereas  only  80%  of  the  Geoquery 
queries  are  semantically  tractable,  which  shows  that  Geoquery  is  indeed  a  more 
challenging  domain  than  Atis. 

Note  that  none  of  the  above  systems  can  be  easily  adapted  for  the  inverse 
task  of  tactical  generation.  In  Chapters  5  and  6,  we  will  show  that  the  Wasp  and 
A-Wasp  semantic  parsing  algorithms  (Chapters  3  and  4)  can  be  readily  inverted  to 
produce  effective  tactical  generators. 

2.3  Natural  Language  Generation 

This  section  provides  a  brief  summary  of  data-driven  approaches  to  natu¬ 
ral  language  generation  (NLG).  More  specifically,  we  focus  on  tactical  generation, 
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which  is  the  generation  of  NL  sentences  from  formal,  symbolic  MRs. 

Early  tactical  generation  systems,  such  as  PENMAN  (Bateman,  1990),  SURGE 
(Elhadad  and  Robin,  1996),  and  RealPro  (Lavoie  and  Rambow,  1997),  typically 
depend  on  large-scale  knowledge  bases  that  are  built  by  hand.  These  systems  are 
often  too  fragile  for  general  use  due  to  knowledge  gaps  in  the  hand-built  grammars 
and  lexicons. 

To  improve  robustness,  Knight  and  Hatzivassiloglou  (1995)  introduce  a  two- 
level  architecture  in  which  a  statistical  n-gram  language  model  is  used  to  rank  the 
output  of  a  knowledge-based  generator.  The  reason  for  improved  robustness  is  two¬ 
fold:  First,  when  dealing  with  new  constructions,  the  knowledge-based  system  can 
freely  overgenerate,  and  let  the  language  model  make  its  selections.  This  simplifies 
the  construction  of  knowledge  bases.  Second,  when  faced  with  incomplete  or  un¬ 
derspecified  input  (e.g.  from  semantic  parsers),  the  language  model  can  help  fill  in 
the  missing  pieces  based  on  fluency. 

Many  subsequent  NLG  systems  follow  the  same  overall  architecture.  For 
example,  Nitrogen  (Langkilde  and  Knight,  1998)  is  an  NLG  system  similar  to 
Knight  and  Hatzivassiloglou  (1995),  but  with  a  more  efficient  knowledge-based 
component  that  operates  bottom-up  rather  than  top-down.  Again,  a  statistical  n- 
gram  ranker  is  used  to  extract  the  best  output  sentence  from  a  set  of  candidates. 
HALogen  (Langkilde-Geary,  2002)  is  a  successor  to  Nitrogen,  which  includes 
a  knowledge  base  that  provides  better  coverage  of  English  syntax. 

Fergus  (Bangalore  et  al.,  2000)  is  an  NLG  system  based  on  the  XTAG 
grammar  (XTAG  Research  Group,  2001).  Given  an  input  dependency  tree  whose 
nodes  are  unordered  and  are  labeled  only  with  lexemes,  a  statistical  tree  model  is 
used  to  assign  the  best  elementary  tree  for  each  lexeme.  Then  a  word  lattice  that 
encodes  all  possible  surface  strings  permitted  by  the  elementary  trees  is  formed. 
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A  trigram  language  model  trained  on  the  Wall  Street  Journal  (WSJ)  corpus  is  then 
used  to  rank  the  candidate  strings. 

Amalgam  (Corston-Oliver  et  al.,  2002;  Ringger  et  al.,  2004)  is  an  NLG 
system  for  French  and  German  in  which  the  mapping  from  underspecified  to  fully- 
specified  dependency  parses  is  mostly  guided  by  learned  decision  tree  classifiers. 
These  classifiers  insert  function  words,  determine  verb  positions,  re-attach  nodes 
for  raising  and  w/? -movement,  and  so  forth.  These  classifiers  are  trained  on  the  out¬ 
put  of  hand-crafted,  broad-coverage  parsers.  Hand-built  classifiers  are  used  when¬ 
ever  there  is  insufficient  training  data.  A  statistical  language  model  is  then  used  to 
determine  the  relative  order  of  constituents  in  a  dependency  parse. 

2.3.1  Chart  Generation 

The  XTAG  grammar  used  by  Fergus  is  a  bidirectional  (or  reversible) 
grammar  that  has  been  used  for  parsing  as  well  (Schabes  and  Joshi,  1988).  The 
use  of  a  single  grammar  for  both  parsing  and  generation  has  been  widely  advocated 
for  its  elegance.  Kay’s  (1975)  research  into  functional  grammar  is  motivated  by  the 
desire  to  “make  it  possible  to  generate  and  analyze  sentences  with  the  same  gram¬ 
mar”.  Jacobs  (1985)  presents  an  early  implementation  of  this  idea.  His  Phred 
generator  operates  from  the  same  declarative  knowledge  base  used  by  PHRAN,  a 
sentence  analyzer  (Wilensky  and  Arens,  1980).  Other  early  NLP  systems  share  at 
least  part  of  the  linguistic  knowledge  for  parsing  and  generation  (Steinacker  and 
Buchberger,  1983;  Wahlster  et  al.,  1983). 

Shieber  (1988)  notes  that  not  only  a  single  grammar  can  be  used  for  parsing 
and  generation,  but  also  the  same  language-processing  architecture  can  be  used  for 
processing  the  grammar  in  both  directions.  He  suggests  that  charts  can  be  a  natural 
uniform  architecture  for  efficient  parsing  and  generation.  This  is  in  marked  contrast 
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to  previous  systems  (e.g.  Phran  and  Phred)  where  the  parsing  and  generation  al¬ 
gorithms  are  often  radically  different.  Kay  (1996)  further  refines  this  idea,  pointing 
out  that  chart  generation  is  similar  to  chart  parsing  with  free  word  order,  because  in 
logical  forms,  the  relative  order  of  predicates  is  immaterial. 

These  observations  have  led  to  the  development  of  a  number  of  chart  gen¬ 
erators.  Carroll  et  al.  (1999)  introduce  an  efficient  bottom-up  chart  generator  for 
head-driven  phrase  structure  grammars  (HPSG).  Constructions  such  as  intersective 
modification  (e.g.  a  tall  young  Polish  athlete )  are  treated  in  a  separate  phase  be¬ 
cause  chart  generation  can  be  exponential  in  these  cases.  Carroll  and  Oepen  (2005) 
further  introduce  a  procedure  to  selectively  unpack  a  derivation  forest  based  on  a 
probabilistic  model,  which  is  a  combination  of  a  4-gram  language  model  and  a 
maximum-entropy  model  whose  feature  types  correspond  to  sub-trees  of  deriva¬ 
tions  (Velldal  and  Oepen,  2005). 

White  and  Baldridge  (2003)  present  a  chart  generator  adapted  for  use  with 
CCG.  A  major  strength  of  the  CCG  generator  is  its  ability  to  generate  a  wide  range 
of  coordination  phenomena  efficiently,  including  argument  cluster  coordination.  A 
statisical  n-gram  language  model  is  used  to  rank  candidate  surface  strings  (White, 
2004). 

Nakanishi  et  al.  (2005)  present  a  similar  probabilistic  chart  generator  based 
on  the  Enju  grammar,  an  English  HPSG  grammar  extracted  from  the  Penn  Treebank 
(Miyao  et  al.,  2004).  The  probabilistic  model  is  a  log-linear  model  with  a  variety  of 
n-gram  features  and  syntactic  features. 

Despite  their  use  of  statistical  models,  all  of  the  above  algorithms  rely  on 
manually-constructed  knowledge  bases  or  grammars  which  are  difficult  to  main¬ 
tain.  Moreover,  they  focus  on  the  task  of  surface  realization,  i.e.  linearizing  and 
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inflecting  words  in  a  sentence,  requiring  extensive  lexical  information  (e.g.  lex¬ 
emes)  in  the  input  logical  forms.  The  mapping  from  predicates  to  lexemes  is  then 
relegated  to  a  separate  sentence  planning  component.  In  Chapters  5  and  6,  we  will 
introduce  tactical  generation  algorithms  that  learn  all  of  their  linguistic  knowledge 
from  annotated  corpora,  and  show  that  surface  realization  and  lexical  selection  can 
be  integrated  in  an  elegant  framework  based  on  synchronous  parsing. 

2.4  Synchronous  Parsing 

In  this  section,  we  define  the  notion  of  synchronous  parsing.  Originally  in¬ 
troduced  by  Aho  and  Ullman  (1969,  1972)  to  model  the  compilation  of  high-level 
programming  languages  into  machine  code,  it  has  recently  been  used  in  various 
NLP  tasks  that  involve  language  translation,  such  as  machine  translation  (Wu,  1997; 
Yamada  and  Knight,  2001;  Chiang,  2005;  Galley  et  al.,  2006),  textual  entailment 
(Wu,  2005),  sentence  compression  (Galley  and  McKeown,  2007),  question  answer¬ 
ing  (Wang  et  al.,  2007),  and  syntactic  parsing  for  resource-poor  languages  (Chiang 
et  al.,  2006).  Shieber  and  Schabes  (1990a, b)  propose  that  synchronous  parsing  can 
be  used  for  semantic  parsing  and  natural  language  generation  as  well. 

Synchronous  parsing  differs  from  ordinary  parsing  in  that  a  derivation  yields 
a  pair  of  strings  (or  trees).  To  finitely  specify  a  potentially  infinite  set  of  string  pairs 
(or  tree  pairs),  we  use  a  synchronous  grammar.  Many  types  of  synchronous  gram¬ 
mars  have  been  proposed  for  NLP,  including  synchronous  context-free  grammars 
(Aho  and  Ullman,  1972),  synchronous  tree- adjoining  grammars  (Shieber  and  Sch¬ 
abes,  1990b),  synchronous  tree-substitution  grammars  (Yamada  and  Knight,  2001), 
and  quasi-synchronous  grammars  (Smith  and  Eisner,  2006).  In  the  next  subsection, 
we  will  illustrate  synchronous  parsing  using  synchronous  context-free  grammars 
(SCFG). 
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2.4.1  Synchronous  Context-Free  Grammars 

An  SCFG  is  defined  by  a  5-tuple: 

G  —  (N,Te,  T/,£,5)  (2.1) 

where  IN'  is  a  finite  set  of  non-terminal  symbols,  Te  is  a  finite  set  of  terminal  sym¬ 
bols  for  the  input  language ,  T/  is  a  finite  set  of  terminal  symbols  for  the  output 
language,  L  is  a  lexicon  consisting  of  a  finite  set  of  production  rules,  and  S'  e  N  is 
a  distinguished  start  symbol.  Each  production  rule  in  £  takes  the  following  form: 

A  —*■  {ot,(3)  (2.2) 

where  A  e  IN,  a  E  (IN  U  Te)+,  and  (3  e  (IN’  U  T/)  +  .  The  non-terminal  A  is  called 
the  left-hand  side  (LHS)  of  the  production  rule.  The  right-hand  side  (RHS)  of  the 
production  rule  is  a  pair  of  strings,  (a,  (3).  For  each  non-terminal  in  a,  here  is  an 
associated,  identical  non-terminal  in  (3 .  In  other  words,  the  non-terminals  in  a  are 
a  permutation  of  the  non-terminals  in  (3.  We  use  indices  QQ,  ■  ■  •  to  indicate  the 
association.  For  example,  in  the  production  rule  A  — >  the  first 

B  non-terminal  in  B^B^\  is  associated  with  the  second  B  non-terminal  in 
Given  an  SCFG,  G,  we  define  a  translation  form  as  follows: 

1.  (S'pp.  .S'pp)  is  a  translation  form. 

2.  If  (aA^/3,  a'Aqfl')  is  a  translation  form,  and  if  A  — >  (7, 7')  is  a  production 
rule  in  XL,  then  (oy/f  o'y'ff)  is  also  a  translation  form.  For  this,  we  write: 

{aA^(3,  a'A^fi)  =>G  {ay (3,  a'y'(3') 

The  non-terminals  A g  are  said  to  be  rewritten  by  the  production  rule  A  — > 
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A  derivation  under  G  is  a  sequence  of  translation  forms: 


(*%},  <%])  =^*G  («1,  /?l)  =^G  •  •  •  =^*G  (otk,  Pk) 

such  that  Ok  E  T+  and  (3k  E  7) .  The  string  pair  (ccfc,  (3k)  is  said  to  be  the  yield  of 
the  derivation,  and  (3k  is  said  to  be  a  translation  of  Ok,  and  vice  versa. 

We  further  define  the  input  grammar  of  G  as  the  4-tuple  Ge  =  (N,  7e ,  £e,  S) , 
where  XLe  =  {A  — ■>  a\A  — >  {a,  (3)  E  XL}.  Similarly,  the  output  grammar  of  G  is  de- 
hned  as  the  4-tuple  Gf  =  (N S ),  where  £/  =  {A  — >  (3\A  — >  {a,  (3)  E  XL}. 
Both  Ge  and  Gj  are  context-free  grammars  (CFG).  We  can  then  view  synchronous 
parsing  as  a  process  in  which  two  CFG  parse  trees  are  generated  simultaneously, 
one  based  on  the  input  grammar,  and  the  other  based  on  the  output  grammar.  Fur¬ 
thermore,  the  two  parse  trees  are  isomorphic ,  since  there  is  a  one-to-one  mapping 
between  the  non-terminal  nodes  in  the  two  parse  trees. 

The  language  translation  task  can  be  formulated  as  follows:  Given  an  input 
string  x,  we  find  a  derivation  under  Ge  that  is  consistent  with  x  (if  any): 

S  =>Ge  Oil  =^Ge  •  •  •  =>Ge  X 

This  derivation  corresponds  to  the  following  derivation  under  G: 

(«%],  <%])  =^G  («1,  (3l)  =>G  ■  ■  ■  =^G  (x,  y) 

The  string  y  is  then  a  translation  of  x. 
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As  a  concrete  example,  suppose  that  G  is  the  following: 

N  =  {S,  NP,  VP} 

7e  =  {wo,  shui  guo,xi  huan} 

If  =  {I,  fruits,  like} 

£  =  {S-,(NPfflVPs,  NPfflVPa), 

NP  — >  (wo  ,  I), 

NP  — ■>  (  shui  guo  ,  fruits  ) , 

VP  — >  (  xi  huan  NPqj  ,  like  NPgj ) } 

S  =  S 

Given  an  input  string,  wo  xi  huan  shui  guo,  a  derivation  under  G  that  is  consistent 
with  the  input  string  would  be: 

( sa  >  Ss )  ( NPa  vp[i]  >  npe  vp®  ) 

=► G  (  wo  VPa  ,  /  VPm  ) 

=>c  {wo  xi  huan  NP7]  ,  I  like  NPj] ) 

=>c  {  wo  xi  huan  shui  guo  ,  I  like  fruits  ) 

Based  on  this  derivation,  a  translation  of  wo  xi  huan  shui  guo  would  be  I  like  fruits. 

Synchronous  grammars  provide  a  natural  way  of  capturing  the  hierarchical 
structures  of  a  sentence  and  its  translation,  as  well  as  the  correspondence  between 
their  sub-parts.  In  Chapters  3-6,  we  will  introduce  algorithms  for  learning  syn¬ 
chronous  grammars  such  as  SCFGs  for  both  semantic  parsing  and  tactical  genera¬ 
tion. 

2.5  Statistical  Machine  Translation 

Another  area  of  research  that  is  relevant  to  our  work  is  machine  translation, 
whose  main  goal  is  to  translate  one  natural  language  into  another.  Machine  trans- 


25 


lation  (MT)  is  a  particularly  challenging  task,  because  of  the  inherent  ambiguity 
of  natural  languages  on  both  sides.  It  has  inspired  a  large  body  of  research.  In 
particular,  the  growing  availability  of  parallel  corpora,  in  which  the  same  content 
is  available  in  multiple  languages,  has  stimulated  interest  in  statistical  methods  for 
extracting  linguistic  knowledge  from  a  large  body  of  text.  In  this  section,  we  review 
the  main  components  of  a  typical  statistical  MT  system. 

Without  loss  of  generality,  we  define  machine  translation  as  the  task  of  trans¬ 
lating  a  foreign  sentence,  f,  into  an  English  sentence,  e.  Obviously,  there  are  many 
acceptable  translations  for  a  given  f .  In  statistical  MT,  every  English  sentence  is  a 
possible  translation  of  f.  Each  English  sentence  e  is  assigned  a  probability  Pr(e|f). 
The  task  of  translating  a  foreign  sentence,  f ,  is  then  to  choose  the  English  sentence, 
e*,  for  which  Pr(e*|f)  is  the  greatest.  Traditionally,  this  task  is  divided  into  several 
more  manageable  sub-tasks,  e.g.: 

e*  =  argmaxPr(e|f)  =  argmaxPr(e)  Pr(f|e)  (2.3) 

e  e 

In  this  noisy-channel  framework,  the  translation  task  is  to  find  an  English  transla¬ 
tion,  e*,  such  that  (1)  it  is  a  well-formed  English  sentence,  and  (2)  it  explains  f  well. 
Pr(e)  is  traditionally  called  a  language  model,  and  Pr(f  |e)  a  translation  model.  The 
language  modeling  problem  is  essentially  the  same  as  in  automatic  speech  recogni¬ 
tion,  where  //-gram  models  are  commonly  used  (Stolcke,  2002;  Brants  et  al.,  2007). 
On  the  other  hand,  translation  models  are  unique  to  statistical  MT,  and  will  be  the 
main  focus  of  the  following  subsections. 

2.5.1  Word-Based  Translation  Models 

Brown  et  al.  (1993b)  present  a  series  of  five  translation  models  which  later 
became  known  as  the  IBM  Models.  These  models  are  word-based  because  they 
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And 
the 
program 
has 
been 
implemented 

Figure  2.3:  A  word  alignment  taken  from  Brown  et  al.  (1993b) 

model  how  individual  words  in  e  are  translated  into  words  in  f .  Such  word-to-word 
mappings  are  captured  in  a  word  alignment  (Brown  et  al.,  1990).  Suppose  that 
e  =  e[  =  (ei, . . . ,  e/),  and  f  =  f{  =  (fu . . . ,  fj).  A  word  alignment,  a,  between 
e  and  f  is  defined  as: 

a  =  (ai, . . . ,  aj)  where  0  <  aj  <  I  for  all  j  —  1, . . . ,  J  (2.4) 

where  aj  is  the  position  of  the  English  word  that  the  foreign  word  fj  is  linked  to. 
If  aj  =  0,  then  fj  is  not  linked  to  any  English  word.  Note  that  in  the  IBM  Models, 
word  alignments  are  constrained  to  be  1-to-n,  i.e.  each  foreign  word  is  linked  to  at 
most  one  English  word.  Figure  2.3  shows  a  sample  word  alignment  for  an  English- 
French  sentence  pair.  In  this  word  alignment,  the  French  word  le  is  linked  to  the 
English  word  the,  the  French  phrase  mis  en  application  as  a  whole  is  linked  to  the 
English  word  implemented ,  and  so  on. 

The  translation  model  Pr(f  |e)  is  then  expressed  as  a  sum  of  the  probabilities 
of  word  alignments  a  between  e  and  f : 

Pr(f|e)  =  5^Pr(f,a|e)  (2.5) 

a 
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The  word  alignments  a  are  hidden  variables  which  must  be  estimated  using  EM. 
Hence  Pr(f  |e)  is  also  called  a  hidden  alignment  model  (or  word  alignment  model). 
The  IBM  Models  mainly  differ  in  terms  of  the  formulation  of  Pr(f,  a|e).  In  IBM 
Models  1  and  2,  this  probability  is  formulated  as: 

j 

Pr(f,  a|e)  =  Pr(J|e)  JJ  Pr(ai|j,  /,  J)  Pr(/,-|ea.)  (2.6) 

j=i 

The  generative  process  for  producing  f  from  e  is  as  follows:  Given  an  English 
sentence,  e,  choose  a  length  J  for  f.  Then  for  each  foreign  word  position,  j,  choose 
aj  from  0,1,...,/,  and  also  fj  based  on  the  English  word  eaj .  Various  simplifying 
assumptions  are  made  so  that  inference  remains  tractable.  In  particular,  a  zero-order 
assumption  is  made  such  that  the  choice  of  aj  is  independent  of  a{  ”  1 ,  e.g.  all  word 
movements  are  independent. 

The  zero-order  assumption  of  IBM  Models  1  and  2  is  unrealistic,  as  it  does 
not  take  collocations  into  account,  such  as  mis  en  application.  In  the  subsequent 
IBM  Models,  this  assumption  is  gradually  relaxed,  so  that  collocations  can  be  better 
modeled.  Exact  inference  is  no  longer  tractable,  so  approximate  inference  must  be 
used.  Due  to  the  complexity  of  these  models,  we  will  not  discuss  them  in  detail. 

Word  alignment  models  such  as  IBM  Models  1-5  are  widely  used  in  work¬ 
ing  with  parallel  corpora.  Among  the  applications  are  extracting  parallel  sentences 
from  comparable  corpora  (Munteanu  et  al.,  2004),  aligning  dependency-tree  frag¬ 
ments  (Ding  et  al.,  2003),  and  extracting  translation  pairs  for  phrase-based  and 
syntax-based  translation  models  (Och  and  Ney,  2004;  Chiang,  2005).  In  Chap¬ 
ters  3  and  4,  we  will  show  that  word  alignment  models  can  be  used  for  extracting 
synchronous  grammar  rules  for  semantic  parsing  as  well. 
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2.5.2  Phrase-Based  and  Syntax-Based  Translation  Models 

A  major  problem  with  the  IBM  Models  is  their  lack  of  linguistic  content. 
One  approach  to  this  problem  is  to  introduce  the  concept  of  phrases  in  a  phrase- 
based  translation  model.  A  basic  phrase-based  model  translates  e  into  f  in  the 
following  steps:  First,  e  is  segmented  into  a  number  of  sequences  of  consecutive 
words  (or  phrases ),  e.\ , . . . ,  e#.  These  phrases  are  then  reordered  and  translated  into 
foreign  phrases,  /i,  •  •  • ,  Jk,  which  are  joined  together  to  form  a  foreign  sentence,  f. 
Och  et  al.  (1999)  introduce  an  alignment  template  approach  in  which  phrase  pairs, 
{(e,  /)},  are  extracted  from  word  alignments.  The  aligned  phrase  pairs  are  then 
generalized  to  form  alignment  templates,  based  on  word  classes  learned  from  the 
training  data.  In  Koehn  et  al.  (2003),  Tillmann  (2003)  and  Venugopal  et  al.  (2003), 
phrase  pairs  are  extracted  from  word  alignments  without  generalization.  In  Marcu 
and  Wong  (2002),  phrase  translations  are  learned  as  part  of  an  EM  algorithm  in 
which  the  joint  probability  Pr(e,  f )  is  estimated. 

Phrase-based  translation  models  can  be  further  generalized  to  handle  hier¬ 
archical  phrasal  structures.  Such  models  are  collectively  known  as  syntax-based 
translation  models.  Yamada  and  Knight  (2001,  2002)  present  a  tree-to-string  trans¬ 
lation  model  based  on  a  synchronous  tree-substitution  grammar  (Knight  and  Graehl, 
2005).  Galley  et  al.  (2006)  extends  the  tree-to-string  model  with  multi-level  syn¬ 
tactic  translation  rules.  Chiang  (2005)  presents  a  hierarchical  phrase-based  model 
whose  underlying  formalism  is  an  SCFG.  Both  Galley  et  al.’s  (2006)  and  Chiang’s 
(2005)  systems  are  shown  to  outperform  state-of-the-art  phrase-based  MT  systems. 

A  common  feature  of  syntax-based  translation  models  is  that  they  are  all 
based  on  synchronous  grammars.  Synchronous  grammars  are  ideal  formalisms  for 
formulating  syntax-based  translation  models  because  they  describe  not  only  the 
hierarchical  structures  of  a  sentence  pair,  but  also  the  correspondence  between  their 
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sub-parts.  In  subsequent  chapters,  we  will  show  that  learning  techniques  developed 
for  syntax -based  statistical  MT  can  be  brought  to  bear  on  tasks  that  involve  formal 
MRLs,  such  as  semantic  parsing  and  tactical  generation. 
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Chapter  3 


Semantic  Parsing  with  Machine  Translation 


This  chapter  describes  how  semantic  parsing  can  be  done  using  statistical 
machine  translation  (Wong  and  Mooney,  2006).  Specifically,  the  parsing  model 
can  be  seen  as  a  syntax-based  translation  model,  and  word  alignments  are  used  in 
lexical  acquisition.  Our  algorithm  is  called  Wasp,  short  for  Word  Alignment-based 
Semantic  Parsing.  In  this  chapter,  we  focus  on  variable-free  MRLs  such  as  FunQL 
and  CLANG  (Section  2.1).  A  variation  of  Wasp  that  handles  logical  forms  will  be 
described  in  Chapter  4.  The  Wasp  algorithm  will  also  form  the  basis  of  our  tactical 
generation  algorithm,  Wasp-1,  and  its  variants  (Chapters  5  and  6). 

3.1  Motivation 

As  mentioned  in  Section  2.2,  prior  research  on  semantic  parsing  has  mainly 
focused  on  relatively  simple  domains  such  as  Atis  (Section  2.1),  where  a  typi¬ 
cal  sentence  can  be  represented  by  a  single  semantic  frame.  Learning  methods 
have  been  devised  that  can  handle  MRs  with  a  complex,  nested  structure  as  in  the 
Geoquery  and  RoboCup  domains.  However,  some  of  these  methods  are  based 
on  deterministic  parsing  (Zelle  and  Mooney,  1996;  Tang  and  Mooney,  2001;  Kate 
et  al.,  2005),  which  lack  the  robustness  that  characterizes  recent  advances  in  statisti¬ 
cal  NLR  Other  methods  involve  the  use  of  fully-annotated  semantically-augmented 
parse  trees  (Ge  and  Mooney,  2005)  or  prior  knowledge  of  the  NL  syntax  (Bos, 
2005;  Zettlemoyer  and  Collins,  2005,  2007)  in  training,  and  hence  require  exten- 
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sive  human  expertise  when  porting  to  a  new  language  or  domain. 

In  this  work,  we  treat  semantic  parsing  as  a  language  translation  task.  Sen¬ 
tences  are  translated  into  formal  MRs  through  synchronous  parsing  (Section  2.4), 
which  provides  a  natural  way  of  capturing  the  hierarchical  structures  of  NL  sen¬ 
tences  and  their  MRL  translations,  as  well  as  the  correspondence  between  their 
sub-parts.  Originally  developed  as  a  theory  of  compilers  in  which  syntax  analysis 
and  code  generation  are  combined  into  a  single  phase  (Aho  and  Ullman,  1972), 
synchronous  parsing  has  seen  a  surge  of  interest  recently  in  the  machine  translation 
community  as  a  way  of  formalizing  syntax-based  translation  models  (Wu,  1997; 
Chiang,  2005).  We  argue  that  synchronous  parsing  can  also  be  useful  in  translation 
tasks  that  involve  both  natural  and  formal  languages,  and  in  semantic  parsing  in 
particular. 

In  subsequent  sections,  we  present  a  learning  algorithm  for  semantic  pars¬ 
ing  called  Wasp.  The  input  to  the  learning  algorithm  is  a  set  of  training  sen¬ 
tences  paired  with  their  correct  MRs.  The  output  from  the  learning  algorithm  is 
a  sychronous  context-free  grammar  (SCFG),  together  with  parameters  that  define 
a  log-linear  distribution  over  parses  under  the  grammar.  The  learning  algorithm 
assumes  that  an  unambiguous,  context-free  grammar  (CFG)  of  the  target  MRL  is 
available,  but  it  does  not  require  any  prior  knowledge  of  the  NL  syntax  or  annotated 
parse  trees  in  the  training  data.  Experiments  show  that  Wasp  performs  favorably  in 
terms  of  both  accuracy  and  coverage  compared  to  other  methods  requiring  similar 
supervision,  and  is  considerably  more  robust  than  methods  based  on  deterministic 
parsing. 
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( (bowner  our  {4})  (do  our  {6}  (pos  (left  (half  our))))) 
If  our  player  4  has  the  ball,  then  our  player  6  should  stay  in  the  left  side  of  our  half 

Figure  3.1:  A  meaning  representation  in  CLANG  and  its  English  gloss 


Rule 


Rule 


If  Condition 


Condition  .  .  . ) 


TEAM  player  UNUM  has  the  ball  (bowner  TEAM 


Unum  }) 


our 


4 


our 


4 


(a)  English 


(b)  CLANG 


Figure  3.2:  Partial  parse  trees  for  the  string  pair  in  Figure  3.1 


3.2  The  Wasp  Algorithm 

To  describe  the  Wasp  semantic  parsing  algorithm,  it  is  best  to  start  with 
an  example.  Consider  the  task  of  translating  the  English  sentence  in  Figure  3.1 
into  its  CLANG  representation  in  the  RoboCup  domain.  To  achieve  this  task,  we 
may  first  analyze  the  syntactic  structure  of  the  English  sentence  using  a  semantic 
grammar  (Section  2.2.2)  ,  whose  non-terminals  are  those  in  the  CLANG  grammar. 
The  meaning  of  the  sentence  is  then  obtained  by  combining  the  meanings  of  its  sub¬ 
parts  based  on  the  semantic  parse.  Figure  3.2(a)  shows  a  possible  semantic  parse  of 
the  sample  sentence  (the  Unum  non-terminal  in  the  parse  tree  stands  for  “uniform 
number”).  Figure  3.2(b)  shows  the  corresponding  CLANG  parse  tree  from  which 
the  MR  is  constructed. 

This  translation  process  can  be  formalized  as  synchronous  parsing.  A  de¬ 
tailed  description  of  the  synchronous  parsing  framework  can  be  found  in  Section 
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2.4.  Under  this  framework,  a  derivation  yields  two  strings,  one  for  the  source  NL, 
and  one  for  the  target  MRL.  Given  an  input  sentence,  e,  the  task  of  semantic  parsing 
is  to  find  a  derivation  that  yields  a  string  pair,  (e,  f),  so  that  f  is  an  MRL  translation 
of  e.  To  finitely  specify  a  potentially  infinite  set  of  string  pairs,  we  use  a  weighted 
SCFG,  G,  defined  by  a  6-tuple: 

G  =  (N,7e,7f,L,S,  A)  (3.1) 

where  N  is  a  finite  set  of  non-terminal  symbols,  Te  is  a  finite  set  of  NL  terminal 
symbols  (words),  7j  is  a  finite  set  of  MRL  terminal  symbols,  £  is  a  lexicon  which 
consists  of  a  finite  set  of  rules1,  S'  G  N  is  a  distinguished  start  symbol,  and  A  is  a  set 
of  parameters  that  define  a  probability  distribution  over  derivations  under  G.  Each 
rule  in  £  takes  the  following  form: 


A  -»•  (a,/3) 


(3.2) 


where  A  e  N,  a  e  (N  U  Te)+,  and  /3  e  (N  U  T/)+.  The  LHS  of  the  rule  is  a 
non-terminal,  A.  The  RHS  of  the  rule  is  a  pair  of  strings,  (a,  /3),  in  which  the  non¬ 
terminals  in  a  are  a  permutation  of  the  non-terminals  in  (3.  Below  are  some  SCFG 
rules  that  can  be  used  to  produce  the  parse  trees  in  Figure  3.2: 


Rule 

Condition 

Team 

Unum 


( if  Condition^  ,  Directive^  .  , 
(Condition^  Directive^)  ) 

(  TeaMqj  player  Unum^  has  (1)  ball , 
(bowner  TeAMqj  {UNUMgj})  ) 

(  our  ,  our  ) 

(4, 4) 


1  Henceforth,  we  reserve  the  term  rules  for  production  rules  of  an  SCFG,  and  the  term  productions 
for  production  rules  of  an  ordinary  CFG. 
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Each  SCFG  rule  A  — >  (a,  3/  is  a  combination  of  a  production  of  the  NL  semantic 
grammar,  A  — *  a,  and  a  production  of  the  MRL  grammar,  A  —>  3.  We  call  the 
string  a  an  NL  string,  and  the  string  3  an  AIR  string.  Non-terminals  in  NL  and  MR 
strings  are  indexed  with  CD,  [2], ...  to  show  their  association.  All  derivations  start  with 
a  pair  of  associated  start  symbols,  (Sjj],  S'-g) .  Each  step  of  a  derivation  involves  the 
rewriting  of  a  pair  of  associated  non-terminals.  Below  is  a  derivation  that  yields  the 
sample  English  sentence  and  its  CLANG  representation  in  Figure  3.1: 

(  RULE[X]  ,  RULEg]  ) 

=>  ( if  Condition^  ,  Directive^  .  , 

(Condition^  Directive^)  ) 

(  if  TeaMq^  player  UNUM^  has  the  ball ,  DIRECTIVE^  .  , 

(  (bowner  TEAM[y]  {UNUM^})  DIRECTIVE^)  ) 

( if  our  player  Unumq-j  has  the  ball ,  Directive^  .  , 

((bowner  our  {UNUM^})  DIRECTIVE^)  ) 

( if  our  player  4  has  the  ball ,  Directive^  .  , 

((bowner  our  {4})  DIRECTIVE^)  ) 

=»  ... 

( if  our  player  4  has  the  ball,  then  our  player  6  should  stay 
in  the  left  side  of  our  half.  , 

( (bowner  our  { 4 } ) 

(do  our  {6}  (pos  (left  (half  our)))))  ) 

Here  the  CLANG  representation  is  said  to  be  a  translation  of  the  English  sentence. 
Given  an  NL  sentence,  e,  there  can  be  multiple  derivations  that  yield  e  (and  thus 
multiple  MRL  translations  of  e).  To  discriminate  the  correct  translation  from  the 
incorrect  ones,  we  use  a  probabilistic  model,  parameterized  by  A,  that  takes  a  deriva¬ 
tion,  d,  and  returns  its  likelihood  of  being  correct.  The  output  translation,  f*,  of  a 
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sentence,  e,  is  defined  as: 


f*  =  f  argmaxPrA(d|e)  (3.3) 

\deD(G\e)  J 

where  f(d)  is  the  MR  string  that  a  derivation  d  yields,  and  I)(G\e)  is  the  set  of  all 
derivations  of  G  that  yield  e.  In  other  words,  the  output  MRL  translation  is  the  yield 
of  the  most  probable  derivation  that  yields  the  input  NL  sentence.  This  formulation 
is  chosen  because  f*  can  be  efficiently  computed  using  a  dynamic -programming 
algorithm  (Viterbi,  1967). 

Since  N,  Te,  T/  and  S  are  fixed  given  an  NL  and  an  MRL,  we  only  need  to 
learn  a  lexicon,  £,  and  a  probabilistic  model  parameterized  by  A.  A  lexicon  defines 
the  set  of  derivations  that  are  possible,  so  the  induction  of  a  probabilistic  model 
requires  a  lexicon  in  the  first  place.  Therefore,  the  learning  task  can  be  divided  into 
the  following  two  sub-tasks: 

1.  Acquire  a  lexicon,  L,  which  implicitly  defines  the  set  of  all  possible  deriva¬ 
tions,  D(G). 

2.  Learn  a  set  of  parameters,  A,  that  define  a  probability  distribution  over  deriva¬ 
tions  in  D(G). 

Both  sub-tasks  require  a  training  set,  {(e;,  f,.)  },  where  each  training  example  (e*,  f,) 
is  an  NL  sentence,  e,,  paired  with  its  correct  MR,  f,.  Lexical  acquisition  also  re¬ 
quires  an  unambiguous  CFG  of  the  MRL.  Since  there  is  no  lexicon  to  begin  with, 
it  is  not  possible  to  include  correct  derivations  in  the  training  data.  Therefore,  these 
derivations  are  treated  as  hidden  variables  which  must  be  estimated  through  EM- 
type  iterative  training,  and  the  learning  task  is  not  fully  supervised.  Figure  3.3  gives 
an  overview  of  the  Wasp  semantic  parsing  algorithm. 
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Training 


Figure  3.3:  Overview  of  the  Wasp  semantic  parsing  algorithm 


In  Sections  3.2. 1-3. 2. 3,  we  will  focus  on  lexical  acquisition.  We  will  de¬ 
scribe  the  probabilistic  model  in  Section  3.2.4. 

3.2.1  Lexical  Acquisition 

A  lexicon  is  a  mapping  from  words  to  their  meanings.  In  Section  2.5.1, 
we  showed  that  word  alignments  can  be  used  for  defining  a  mapping  from  words 
to  their  meanings.  In  Wasp,  we  use  word  alignments  for  lexical  acquisition.  The 
basic  idea  is  to  train  a  statistical  word  alignment  model  on  the  training  set,  and  then 
find  the  most  probable  word  alignments  for  each  training  example.  A  lexicon  is 
formed  by  extracting  SCFG  rules  from  these  word  alignments  (Chiang,  2005). 

Let  us  illustrate  this  algorithm  using  an  example.  Suppose  that  we  are  given 
the  string  pair  in  Figure  3.1  as  the  training  data.  The  word  alignment  model  is  to 
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find  a  word  alignment  for  this  string  pair.  A  sample  word  alignment  is  shown  in 
Figure  3.4,  where  each  CLANG  symbol  is  treated  as  a  word.  This  presents  three 
difficulties.  First,  not  all  MR  symbols  carry  specific  meanings.  For  example,  in 
CLANG,  parentheses  ( (,  ) )  and  braces  ({,  })  are  delimiters  that  are  semantically 
vacuous.  Such  symbols  are  not  supposed  to  be  aligned  with  any  words,  and  inclu¬ 
sion  of  these  symbols  in  the  training  data  is  likely  to  confuse  the  word  alignment 
model.  Second,  not  all  concepts  have  an  associated  MR  symbol.  For  example,  in 
CLANG,  the  mere  appearance  of  a  condition  followed  by  a  directive  indicates  an 
if-then  rule,  and  there  is  no  CLANG  predicate  associated  with  the  concept  of  an 
if-then  rule.  Third,  multiple  concepts  may  be  associated  with  the  same  MR  symbol. 
For  example,  the  CLANG  predicate  pt  is  polysemous.  Its  meaning  depends  on  the 
types  of  arguments  it  is  given.  It  specifies  the  ^-coordinates  when  its  arguments 
are  two  numbers  (e.g.  (pt  0  0 ) ),  the  current  position  of  the  ball  when  its  argu¬ 
ment  is  the  MR  symbol  ball  (i.e.  (pt  ball ) ),  or  the  current  position  of  a  player 
when  a  team  and  a  uniform  number  are  given  as  arguments  (e.g.  (pt  our  4  ) ). 
Judging  from  the  pt  symbol  alone,  the  word  alignment  model  would  not  be  able  to 
identify  its  exact  meaning. 

A  simple,  principled  way  to  avoid  these  difficulties  is  to  represent  an  MR 
using  a  sequence  of  MRL  productions  used  to  generate  it.  This  sequence  corre¬ 
sponds  to  the  top-down,  left-most  derivation  of  an  MR.  Each  MRL  production  is 
then  treated  as  a  word.  Figure  3.5  shows  a  word  alignment  between  the  sample 
sentence  and  the  linearized  parse  of  its  CLANG  representation.  Here  the  second 
production,  CONDITION  — >  (bowner  Team  {Unum}  ) ,  is  the  one  that  rewrites 
the  Condition  non-terminal  in  the  first  production,  Rule  — >  (Condition  Di¬ 
rective)  ,  and  so  on.  Treating  MRL  productions  as  words  allows  collocations 
to  be  treated  as  a  single  lexical  unit  (e.g.  the  symbols  (,  pt,  ball,  followed  by 
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Figure  3.4:  A  word  alignment  between  English  words  and  CLANG  symbols 
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) ).  A  lexical  unit  can  be  discontiguous  (e.g.  (,  pos,  followed  by  a  region,  and 
then  the  symbol  ) ).  It  also  allows  the  meaning  of  a  polysemous  MR  symbol  to  be 
disambiguated,  where  each  possible  meaning  corresponds  to  a  distinct  MRL  pro¬ 
duction.  In  addition,  it  allows  productions  that  are  unlexicalized  (e.g.  Rule  — > 
(Condition  Directive)  )  to  be  associated  with  some  English  words.  Note  that 
for  each  MR  there  is  a  unique  parse  tree,  since  the  MRL  grammar  is  unambiguous. 
Also  note  that  the  structure  of  a  MR  parse  tree  is  preserved  through  linearization. 
The  structural  aspect  of  an  MR  parse  tree  will  play  an  important  role  in  the  subse¬ 
quent  extraction  of  SCFG  rules. 

Word  alignments  can  be  obtained  using  any  off-the-shelf  word  alignment 
model.  In  this  work,  we  use  the  GIZA++  implementation  (Och  and  Ney,  2003)  of 
IBM  Model  5  (Brown  et  al.,  1993b). 

Assuming  that  each  NL  word  is  linked  to  at  most  one  MRL  production, 
SCFG  rules  are  extracted  from  a  word  alignment  in  a  bottom-up  manner.  The  pro¬ 
cess  starts  with  productions  with  no  non-terminals  on  the  RHS,  e.g.  Team  — >  our 
and  Unum  — >  4.  For  each  of  these  productions,  A  — >  (5,  an  SCFG  rule  A  — >  (a,  (5) 
is  extracted  such  that  a  consists  of  the  words  to  which  the  production  is  linked.  For 
example,  the  following  rules  would  be  extracted  from  Figure  3.5: 

Team  — >  (  our ,  our  ) 

Unum  —>(4,4) 

Unum  —>(6,6) 

Next  we  consider  productions  with  non-terminals  on  the  RHS,  i.e.  predi¬ 
cates  with  arguments.  In  this  case,  the  NL  string  a  consists  of  the  words  to  which 
the  production  is  linked,  as  well  as  non-terminals  showing  where  the  arguments  are 
realized.  For  example,  for  the  bowner  predicate,  the  extracted  rule  would  be: 
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Directive^  (do  Team  {Unum}  Action) 
Team  — >  our 
Unum  — >  6 

Action  — »■  (pos  Region) 

Region  — >  (left  Region) 

Region^  (half  Team) 

Team  — >  our 


Figure  3.5:  A  word  alignment  between  English  words  and  CLANG  productions 
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Condition  — >  (  Team  ^player  Unum^  has  (1)  ball , 

(bowner  TEAM[j  {UNUM^})  ) 

where  (1)  denotes  a  word  gap  of  size  1,  due  to  the  unaligned  word  the  that  comes 
between  has  and  ball.  Formally,  a  word  gap  of  size  g  can  be  seen  as  a  special 
non-terminal  that  expands  to  at  most  g  NL  words,  which  allows  for  some  flexibility 
during  pattern  matching.  Note  the  use  of  indices  to  indicate  the  association  between 
non-terminals  in  the  extracted  NL  and  MR  strings. 


Similarly,  the  following  SCFG  rules  would  be  extracted  from  the  same  word 
alignment: 


Region 

Region 

Action 

Directive 

Rule 


(  TEAM[X]  half,  (half  TEAMjj)  ) 

{left  side  of  Region^,  (left  Region^)  ) 
(  stay  in  (1)  Region^  ,  (pos  Region^)  ) 

(  TeaMq-]  player  UNUM^  should  ACTION^  , 

( do  Teamjyj  {Unum^}  Action^)  ) 

( if  Condition^  (1)  Directive^  (1) , 
(Condition^  Directive^)  ) 


Note  the  word  gap  (1)  at  the  end  of  the  NL  string  in  the  last  rule,  which  is  due  to 
the  unaligned  period  in  the  sentence.  This  word  gap  is  added  because  all  words  in 
a  sentence  have  to  be  consumed  by  a  derivation. 


Figure  3.6  shows  the  basic  lexical  acquisition  algorithm  of  Wasp.  The 
training  set,  T  =  {(e*,  f , ) } ,  is  used  to  train  the  alignment  model  M,  which  is  in 
turn  used  to  obtain  the  k- best  word  alignments  for  each  training  example  (we  use 
k  =  10).  SCFG  rules  are  extracted  from  each  of  these  word  alignments.  It  is  done 
in  a  bottom-up  fashion,  such  that  an  MR  predicate  is  processed  only  after  its  argu¬ 
ments  have  all  been  processed.  This  order  is  enforced  by  the  backward  traversal  of 
a  linearized  MR  parse.  The  lexicon,  L  then  consists  of  all  rules  extracted  from  all 
k- best  word  alignments  for  all  training  examples. 
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Input:  a  training  set,  T  =  {(e*,  fj)},  and  an  unambiguous  MRL  grammar,  G' . 


Acquire-Lexicon(T,  G') 

1  L  ^  0 

for  i  1  to  \T\ 

do  f'  <—  linearized  parse  of  f,  under  G' 

Train  a  word  alignment  model,  M,  using  {(e,,  f')}  as  the  training  set 
for  i  < —  1  to  \T\ 

k- best  word  alignments  for  (e(,  f')  under  M 

1  to  k 

downto  1 

M/y) 

words  to  which  f[-  and  its  arguments  are  linked  in  a£, 

rh§  {fij) 

£U{d->  (a,  /?)} 


2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


do  a*  k 

for  k! 

do  for  j 


do  A 

a  ■ 

P' 

L 


Replace  a  with  A  in  a 


return  T 


Figure  3.6:  The  basic  lexical  acquisition  algorithm  of  Wasp 

3.2.2  Maintaining  Parse  Tree  Isomorphism 

There  are  two  cases  where  the  Acquire-Lexicon  procedure  would  not 
extract  any  rules  for  a  production  p: 

1.  None  of  the  descendants  of  p  in  the  MR  parse  tree  are  linked  to  any  words. 

2.  The  NL  string  associated  with  p  covers  a  word  w  linked  to  a  production  p’  that 
is  not  a  descendant  of  p  in  the  MR  parse  tree.  Rule  extraction  is  forbidden  in 
this  case  because  it  would  destroy  the  link  between  w  and  //. 

The  first  case  arises  when  a  concept  is  not  realized  in  NL.  For  example,  the  concept 
of  “our  team”  is  often  assumed,  because  advice  is  given  from  the  perspective  of  a 
team  coach.  When  we  say  the  goalie  should  always  stay  in  our  goal  area ,  we  mean 
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Region  — ►  (left  Region) 

REGION  — >  (penalty-area  TEAM) 
Team  — >  our 

Acquire-Lexicon  procedure  fails 

our  (our)  goalie,  not  the  other  team’s  (opp)  goalie.  Hence  the  concept  of  our 
is  often  not  realized.  The  second  case  arises  when  the  NL  and  MR  parse  trees  are 
not  isomorphic.  Consider  the  word  alignment  between  our  left  penalty  area  and 
its  CLang  representation  in  Figure  3.7.  The  extraction  of  the  rule  Region  — *  ( 
TeaMq-]  (1)  penalty  area  ,  (penalty-area  TEAM^)  )  would  destroy  the  link 
between  left  and  Region  — >  (left  Region)  .  A  possible  explanation  for  this  is 
that,  syntactically,  our  modifies  left  penalty  area  (consider  the  coordination  phrase 
our  left  penalty  area  and  right  goal  area,  where  our  modifies  both  left  penalty  area 
and  right  goal  area).  But  conceptually,  “left”  modifies  the  concept  of  “our  penalty 
area”  by  referring  to  its  left  half.  Note  that  the  NL  and  MR  parse  trees  must  be 
isomorphic  under  the  SCFG  formalism  (Section  2.4.1). 

The  NL  and  MR  parse  trees  can  be  made  isomorphic  by  merging  nodes  in 
the  MR  parse  tree,  combining  several  productions  into  one.  For  example,  since  no 
rules  can  be  extracted  for  the  production  Region  — >  (penalty-area  Team)  ,  it 
is  combined  with  its  parent  node  to  form  Region  — >  (left  (penalty-area 
Team  )  ) ,  for  which  an  NL  string  Team  left  penalty  area  is  extracted.  In  general, 
the  merging  process  continues  until  a  rule  is  extracted  from  the  merged  node.  As¬ 
suming  the  alignment  is  not  empty,  the  process  is  guaranteed  to  end  with  a  rule 
extracted. 


Figure  3.7:  A  case  where  the 
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Region^  (reg  Region  Region) 
Region^  (left  Region) 

REGION  — >  (penalty-area  TEAM) 
Team  — >  our 

Region  — >  (right  Region) 

Region^  (midfield Team) 

Team  ->our 

Figure  3.8:  A  case  where  a  bad  link  disrupts  phrasal  coherence 

3.2.3  Phrasal  Coherence 

The  effectiveness  of  the  lexical  acquisition  algorithm  described  so  far  crit¬ 
ically  depends  on  whether  the  word  alignment  model  observes  phrasal  coherence. 
This  means  words  that  are  linked  to  a  predicate  and  its  arguments  should  stay  close 
together.  Moreover,  these  words  should  form  a  hierarchical  phrase  structure  that 
is  roughly  isomorphic  to  the  MR  parse  tree.  Any  major  disruption  of  phrasal  co¬ 
herence  would  lead  to  excessive  node  merging  (Section  3.2.2),  which  is  a  major 
cause  of  overfitting.  For  example,  in  Figure  3.8,  the  word  right  is  far  from  left 
penalty  area ,  yet  it  is  linked  to  the  left  predicate  (shown  as  a  thick  line).  This 
link  crosses  many  other  links  in  the  word  alignment,  forcing  many  nodes  in  the 
MR  parse  tree  to  merge  (e.g.  left  with  reg,  midfield  with  right  and  then 
with  reg).  The  resulting  SCFG  rule,  Region  — »■  (  TeaMqj  left  penalty  area  or 
TEAM^  right  midfield  ,  (reg  (left  (penalty-area  TEAM^)  )  (right 
(midfield  Team^)  )  )  ),  is  very  long  and  does  not  generalize  well  to  other 
cases  of  region  union  (reg). 

Ideally,  this  problem  can  be  solved  using  a  word  alignment  model  that 
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strictly  observes  phrasal  coherence.  However,  this  often  requires  rules  that  model 
the  reordering  of  tree  nodes  (i.e.  synchronous  grammars),  which  are  exactly  what 
Wasp  is  trying  to  learn.  Our  goal  is  to  bootstrap  the  learning  process  by  using 
a  simpler,  word-based  alignment  model  that  produces  a  generally  coherent  align¬ 
ment,  and  then  remove  links  that  could  cause  excessive  node  merging.  This  is  done 
before  rule  extraction  takes  place. 

The  link  removal  algorithm  works  as  follows.  Recall  that  rule  extraction 
from  a  word  alignment,  a,  is  forbidden  where  the  NL  string  associated  with  a  pro¬ 
duction,  p,  covers  a  word  linked  to  a  production  that  is  not  a  descendant  of  p  in 
the  MR  parse  tree.  We  call  such  a  word  a  violation  of  the  isomorphism  constraint. 
For  each  production  p  in  the  MR  parse  tree,  we  count  the  number  of  violations  that 
would  prevent  a  rule  from  being  extracted  for  p.  Then  the  total  number  of  viola¬ 
tions  for  all  productions  in  the  MR  parse  tree  is  obtained,  denoted  by  u(a).  A  simple 
greedy  procedure  for  removing  bad  links  is  to  repeatedly  remove  the  link  a  e  a  that 
maximizes  u(a)  —  u(a  \  {a})  >  0,  until  u(a)  cannot  be  further  reduced.  A  link 
stronger  than  a  certain  threshold  (0.9)  is  never  removed,  so  that  merging  of  produc¬ 
tions  as  in  Figure  3.7  is  still  possible.  The  strength  of  a  link  w  <-►  p  is  defined  as 
the  translation  probability,  Pr(p|w),  given  by  GIZA++,  which  is  found  to  be  highly 
correlated  with  the  validity  of  a  link.  To  replenish  the  removed  links,  links  from  a 
reverse  alignment,  a  (obtained  by  treating  the  source  language  as  target,  and  vice 
versa),  are  added  to  a,  as  long  as  a  remains  n-to-1,  and  v(a)  is  not  increased. 

The  complete  lexical  acquisition  algorithm  is  thus  the  following:  Train  a 
word  alignment  model,  M,  and  a  reverse  word  alignment  model,  M,  using  the 
training  set,  T.  Obtain  the  k- best  alignments,  a)  k,  and  the  best  reverse  alignment, 
a*,  for  each  training  example  in  T  using  M  and  M.  Remove  bad  links  from  each 
a£,  and  replenish  the  removed  links  by  adding  links  from  a*.  Then  extract  rules 
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from  k  as  described  in  the  ACQUIRE-LEXICON  procedure  (lines  7-13),  while 
merging  nodes  in  the  MR  parse  tree  if  necessary. 

3.2.4  Probabilistic  Model 

Once  a  lexicon  is  acquired,  the  next  task  is  to  learn  a  probabilistic  model 
for  parse  disambiguation.  We  propose  a  log-linear  model  that  defines  a  conditional 
probability  distribution  over  derivations  given  an  input  NL  sentence.  There  has 
been  much  work  on  using  log-linear  models  for  NLP  tasks  such  as  part-of-speech 
tagging  (Ratnaparkhi,  1996),  syntactic  parsing  (Charniak,  2000;  Clark  and  Curran, 
2003),  named  entity  recognition  (Chieu  and  Ng,  2003),  and  machine  translation 
(Koehn  et  al.,  2003).  A  primary  advantage  of  log-linear  models  is  their  flexibility. 
Features  may  interact  with  each  other,  allowing  easy  experimentation  with  different 
feature  sets.  Similar  to  Riezler  et  al.  (2002),  we  will  train  our  log-linear  model  on 
incomplete  data,  since  derivations  are  not  observed  in  the  training  data.  It  is  the 
yields  of  these  derivations — NL  sentences  and  their  MRs — that  we  observe. 

In  our  log-linear  model,  the  conditional  probability  of  a  derivation,  d,  given 
an  input  sentence,  e,  is  defined  as: 


where  f  is  a.  feature  function  {ox  feature  for  short)  that  returns  a  real  value  given  a 
derivation,  and  Z\(e)  is  a  normalizing  factor  such  that  the  conditional  probabilities 
sum  to  one  over  all  derivations  that  yield  e.  We  use  the  following  feature  types: 

•  For  each  rule  r  e  L,  there  is  a  feature,  fr,  that  returns  the  number  of  times  r 
is  used  in  a  derivation. 

•  For  each  word  w  G  Te,  there  is  a  feature,  that  returns  the  number  of  times 
w  is  generated  from  word  gaps  in  a  derivation. 
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•  Generation  of  words  not  previously  encountered  during  training  is  modeled 
using  an  extra  feature,  /*,  that  returns  the  total  number  of  words  generated 
from  word  gaps  in  a  derivation. 

In  Wasp,  since  the  output  grammar  of  a  learned  SCFG  is  the  target  MRL  grammar, 
all  MRL  translations  are  well-formed  to  begin  with.  So  the  probabilistic  model  can 
be  relatively  simple.  The  number  of  features  that  we  use  in  our  log-linear  model  is 
quite  modest  (less  than  3,000  in  our  experiments).  A  similar  set  of  features  is  also 
used  by  Zettlemoyer  and  Collins  (2005). 

The  output  MRL  translation,  f*,  given  a  sentence,  e,  is  the  yield  of  the  most 
probable  derivation  that  yields  e  (cf.  Equation  3.3): 


f*  =  f  arg 


f  arg 


(3.5) 


where  D{G  |e)  is  the  set  of  derivations  under  G  that  yield  e.  The  output  translation 
can  be  easily  computed  using  the  Viterbi  algorithm  (Viterbi,  1967),  with  an  Earley 
chart  (Earley,  1970;  Stolcke,  1995)  that  keeps  track  of  derivations  that  are  consistent 
with  the  input  string.  Decoding  takes  cubic  time  with  respect  to  the  sentence  length. 

The  model  parameters,  A,  are  estimated  by  maximizing  the  conditional  log- 
likelihood  of  the  training  set  (Berger  et  al.,  1996;  Riezler  et  al.,  2002): 


A*  =  arg  max 


A 


(3.6) 
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Expanding  the  conditional  log-likelihood,  we  get: 


L(A)  =  E  loS  Pr A (fj  | ej ) 

(ej,fi)eT 

=  E  bg  E  PrA(dlei) 

(ej,fj)GT  dGD(G|ej,f3-) 

=  E  (  log  (  E  exP  E/  Kfi(d)  )  —  iog  Z\(ej 

{Bj,fj)eT  \  \dGD(G|ej,f,)  i 

E  log(  E  exp^Ai/i(d) 

(ej,fj>GT  \  \deG(G|ej,fj)  i 


!°g  (  E  expEA*^(d) 

,deD(G|ej)  i 


where  D(G\e,  f)  is  the  set  of  derivations  under  G  that  yield  (e,  f )  (hence  D(G\e,  f)  C 
D(G |e)).  Differentiating  L  with  respect  to  A,-  gives: 


AE(A)  =  E  E  PrA(d|e;,f,)/i(d)-  V  Pr^dle^/ifd) 

*  {ej ,fj > GT  \dGD(G|ej,f,-)  dG£(G|ej) 

which  is  the  difference  between  the  expectations  of  /)( d)  with  respect  to  the  dis¬ 
tributions  PrA(d|eJ,  £,■)  and  PrA(d|e.;).  Locally  optimal  parameters  A*  can  then  be 
found  by  using  gradient-based  methods  such  as  gradient  ascent,  conjugate  gradient, 
and  quasi-Newton  methods.  In  our  experiments,  we  use  the  L-BFGS  algorithm  (No- 
cedal,  1980)  to  compute  A*.  L-BFGS  is  a  limited-memory  quasi-Newton  method 
which  implicitly  approximates  the  Hessian  matrix  based  on  previous  values  of  L 
and  L'.  It  has  shown  good  convergence  properties  in  various  NLP-related  optimiza¬ 
tion  tasks  (Malouf,  2002). 


Computation  of  L  and  IJ  requires  statistics  that  depend  on  D  ( G |  e.; ,  f?)  and 
D(G\ej).  Since  both  sets  can  be  extremely  large,  it  is  not  feasible  to  enumerate 
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them.  However,  using  a  similar  parsing  chart  used  for  decoding,  it  is  possible  to 
obtain  the  required  statistics  using  dynamic-programming  techniques  similar  to  the 
Inside-Outside  algorithm  (Miyao  and  Tsujii,  2002).  In  particular,  computation  that 
involves  D(G\ej,  f , )  can  be  done  by  keeping  track  of  MR  translations  inside  chart 
items,  and  allowing  chart  items  to  combine  only  when  it  results  in  a  substring  of  f ). 

A  Gaussian  prior  (a2  =  100)  is  used  to  regularize  the  log-linear  model 
(Chen  and  Rosenfeld,  1999).  Unlike  the  fully-supervised  case,  the  conditional  log- 
likelihood  L  is  not  concave  with  respect  to  A,  so  the  optimization  algorithm  is  sen¬ 
sitive  to  initial  parameters.  To  assume  as  little  as  possible,  A  is  initialized  to  0. 
Following  Zettlemoyer  and  Collins  (2005),  only  rules  that  are  used  in  the  most 
probable  derivations  for  each  training  example  are  retained  in  the  final  lexicon.  All 
other  rules  are  discarded.  This  heuristic  is  used  to  improve  accuracy  of  the  seman¬ 
tic  parser,  assuming  that  rules  used  in  the  most  probable  derivations  are  the  most 
accurate. 

In  summary,  the  Wasp  learning  algorithm  is  divided  into  two  sub-tasks.  The 
first  sub-task  is  to  acquire  a  lexicon  consisting  of  SCFG  rules  extracted  from  word 
alignments  between  training  sentences  and  their  correct  MRs.  The  second  sub-task 
is  to  estimate  the  parameters  that  define  a  log-linear  distribution  over  parses  under 
the  learned  SCFG.  The  resulting  weighted  SCFG  can  then  be  used  for  parsing  novel 
sentences. 

3.3  Experiments 

This  section  describes  the  experiments  that  were  performed  to  demonstrate 
the  effectiveness  of  the  Wasp  semantic  parsing  algorithm. 
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3.3.1  Data  Sets 


We  evaluated  Wasp  in  the  Geoquery  and  RoboCup  domains  (Section 
2.1).  The  Geoquery  corpus  consists  of  880  English  questions  gathered  from  var¬ 
ious  sources.  250  of  them  were  gathered  from  an  undergraduate  language  class 
(Zelle  and  Mooney,  1996).  These  questions  were  manually  translated  into  a  logical 
query  language  based  on  Prolog  (Appendix  A.l).  An  additional  630  English  ques¬ 
tions  were  subsequently  gathered  from  an  undergraduate  AI  class,  and  from  users 
of  a  web  interface  to  a  Chill  (Zelle  and  Mooney,  1996)  prototype  trained  on  the 
initial  250  data  set  (Tang  and  Mooney,  2001).  These  questions  together  with  their 
Prolog  logical  forms  and  the  original  250  data  set,  form  a  larger  880-example  data 
set.  Queries  in  the  250  data  set  were  also  translated  into  Spanish  and  Turkish,  each 
by  a  native  speaker  of  the  language,  and  into  Japanese  by  an  English  native  speaker 
who  had  learned  Japanese  as  a  second  language. 

Since  Wasp  can  only  handle  variable-free  MRLs,  we  wrote  a  small  program 
that  translates  Prolog  logical  forms  into  a  functional  query  language  (FunQL)  de¬ 
veloped  for  the  Geoquery  domain  (Appendix  A.2). 

For  the  RoboCup  domain,  the  corpus  consists  of  300  pieces  of  coaching 
advice  encoded  in  CLANG  (Appendix  A.3),  randomly  selected  from  the  log  files 
of  the  2003  RoboCup  Coach  Competition.  Each  formal  statement  was  manually 
translated  into  English  by  one  of  four  annotators  (Kate  et  al.,  2005).  Basically, 
CLANG  statements  are  variable-free.  The  CLANG  language  does  allow  the  use  of 
logical  variables,  but  they  have  very  limited  use  and  rarely  occur  in  the  data. 

Table  3.1  shows  some  statistics  of  the  corpora  used  for  evaluating  Wasp. 
Note  that  sentences  in  the  RoboCup  data  set  are  much  longer  than  those  in  Geo¬ 
query  on  average. 


51 


Geoquery 

RoboCup 

MRL 

FunQL 

CLang 

#  non-terminals 

13 

12 

#  productions 

133 

134 

#  examples 

250 

880 

300 

NL 

English 

Spanish 

Japanese 

Turkish 

English 

English 

Avg.  sent,  length 

6.87 

7.39 

9.11 

5.76 

7.57 

22.52 

#  unique  words 

165 

159 

158 

220 

280 

337 

Table  3.1:  Corpora  used  for  evaluating  WASP 


3.3.2  Methodology 


We  performed  standard  10-fold  cross  validation  in  our  experiments.  During 
testing,  we  counted  the  number  of  sentences  for  which  there  was  an  output  MRL 
translation.  Translation  fails  when  there  are  constructs  in  a  sentence  that  a  learned 
parser  does  not  cover.  We  also  counted  the  number  of  output  MRL  translations  that 
were  correct.  For  Geoquery,  a  translation  is  correct  if  it  retrieves  the  same  answer 
from  the  Geoquery  database  as  the  reference  query.  For  RoboCup,  a  translation 
is  correct  if  it  exactly  matches  the  correct  MR,  up  to  reordering  of  arguments  for 
commutative  predicates  like  and.  These  strict  criteria  were  chosen  because  two 
slightly  different  representations  can  have  very  different  meanings  (e.g.  negation). 
Based  on  these  counts,  we  compute  the  precision,  recall  and  F-measure  of  a  learned 
parser: 


Precision 


Recall 

F-measure 


No.  of  correct  output  translations 
No.  of  output  translations 

No.  of  correct  output  translations 
No.  of  test  sentences 
2  x  Precision  x  Recall 
Precision  +  Recall 


(3.7) 

(3.8) 

(3.9) 
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For  each  domain,  there  is  a  minimal  set  of  initial  rules  representing  knowl¬ 
edge  needed  for  translating  basic  domain  entities.  These  rules  are  always  included 
in  a  lexicon.  For  Geoquery,  the  initial  rules  are  the  following: 

CityName  — >  (e(c),  c),  for  all  city  names  c 

e.g.  for  English:  ( new  york ,  'new  york'  ) 
for  Japanese:  (  nyuu  yooku  ,  '  new  york'  ) 
RiverName  — >  (e(r),  r),  for  all  river  names  r 
StateName  — >  (e(s),  s ),  for  all  state  names  s 

Similar  rules  for  lake  names,  mountain  names,  state  name  abbrevia¬ 
tions,  and  other  place  names. 

Here  e(x)  is  an  NL  expression  that  corresponds  to  x.  Since  the  Geoquery  database 
is  in  English,  c(x)  =  x  for  English.  For  other  languages,  c(x)  can  be  different.  For 
example,  e('  new  york' )  is  nyuu  yooku  in  Japanese.  A  rule  such  as  CityName 
— >  (  nyuu  yooku  ,  '  new  york'  )  provides  domain  knowledge  that  cannot  be  eas¬ 
ily  learned  without  analyzing  the  phonological  features  of  a  name.  Such  initial  rules 
can  be  easily  constructed  from  a  bilingual  dictionary.  Note  that  a  name  can  be  am¬ 
biguous.  For  example,  New  York  can  be  either  a  state  or  a  city.  A  semantic  parser 
needs  to  disambiguate  between  these  two  cases  based  on  surrounding  context. 

For  RoboCup,  the  initial  rules  are  the  following: 

Unum  — >  (i,  i),  for  all  integers  i  —  1, . . . ,  11 
Num  — >  (x,  x),  for  all  real  numbers  x 
Ident  — >  ( s ,  "s"),  for  all  possible  CHANG  identifiers  s 

The  purpose  of  these  initial  rules  is  to  provide  a  default  translation  for  all  unseen 
numbers  and  identifiers. 
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Geoquery  (880  data  set) 

RoboCup 

Prec.  (%) 

Rec.  (%) 

F  (%) 

Prec.  (%) 

Rec.  (%) 

F  (%) 

Wasp 

87.2 

74.8 

80.5 

88.9 

61.9 

73.0 

Cocktail 

89.9 

79.4 

84.3 

- 

- 

- 

Silt 

89.0 

54.1 

67.3 

83.9 

50.7 

63.2 

Krisp 

93.3 

71.7 

81.1 

85.2 

61.9 

71.7 

Scissor 

95.5 

77.2 

85.4 

90.0 

80.7 

85.1 

ZC07 

95.5 

83.2 

88.9 

- 

- 

- 

Table  3.2:  Performance  of  semantic  parsers  on  the  English  corpora 


3.3.3  Results  and  Discussion 

Table  3.2  shows  the  performance  of  Wasp  on  the  English  corpora  with  full 
training  data,  compared  to  five  other  algorithms:2 

•  Cocktail  (Tang  and  Mooney,  2001),  a  shift-reduce  parser  based  on  induc¬ 
tive  logic  proramming. 

•  Silt  (Kate  et  al.,  2005),  a  deterministic  parser  using  tree-to-string  transfor¬ 
mation  rules. 

•  Krisp  (Kate  and  Mooney,  2006),  an  SVM-based  parser  using  string  kernels. 

•  Scissor  (Ge  and  Mooney,  2006),  a  combined  syntactic-semantic  parser  with 
discriminative  reranking. 

•  Zettlemoyer  and  Collins  (2007)  (abbreviated  as  ZC07),  a  probabilistic  parser 
based  on  relaxed  CCG  grammars. 

2The  results  reported  in  Zettlemoyer  and  Collins  (2007)  for  GEOQUERY  are  based  on  a  single 
split  of  data  with  600  training  examples.  Our  experiments  using  their  split  gave  similar  results. 
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The  best-performing  systems  for  each  domain  are  shown  in  bold  in  Table  3. 2. 3 
Figures  3.9  and  3.10  show  the  precision  and  recall  learning  curves. 

Regarding  these  results,  several  points  should  be  noted: 

•  Due  to  memory  overflow,  COCKTAIL  cannot  handle  more  than  160  training 
examples  in  the  RoboCup  domain. 

•  No  results  have  been  reported  for  ZC07  in  the  RoboCup  domain.  In  fact, 
it  is  unclear  how  ZC07  can  deal  with  discontiguous  lexical  items  which  fre¬ 
quently  appear  in  this  domain  (see  Section  5.4.2  for  further  discussion  of 
discontiguous  lexical  items). 

•  Both  Cocktail  and  ZC07  use  Prolog  logical  forms  as  the  target  MRL  for 
Geoquery.  In  Section  4.3.2,  we  show  that  Prolog  logical  forms  can  be  a 
better  MRL  for  this  domain. 

•  A  hand-built  lexicon  was  supplied  to  COCKTAIL  in  the  Geoquery  domain. 
For  RoboCup,  lexicons  automatically  acquired  by  Wolfie  (Thompson  and 
Mooney,  1999)  were  used  instead. 

•  SCISSOR  requires  semantically-augmented  parse  trees  for  training  (Section 

2.2.1). 

•  ZC07  requires  the  following  hand- written  components:  (1)  language-specific 
template  rules  (Section  2.2.1),  and  (2)  lexical  items  for  certain  function  words 
such  as  wh- words  and  determiners. 

3No  statistical  test  was  performed  for  two  reasons.  First,  the  experimental  set-up  in  ZC07  was 
different.  Also  for  SCISSOR,  neither  per-trial  statistics  nor  actual  system  output  was  available. 
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(a)  Precision 


(b)  Recall 


Figure  3.9:  Learning  curves  for  semantic  parsers  on  the  Geoquery  880  data  set 


56 


c 

o 

(j) 

o 

0 

ol 


0 

DC 
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Figure  3.10:  Learning  curves  for  semantic  parsers  on  the  RoboCup  data  set 
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Therefore,  compared  to  Wasp,  Silt  and  Krisp,  more  prior  knowledge  was  in¬ 
corporated  into  Cocktail  (in  the  Geoquery  domain),  Scissor  and  ZC07.  The 
experimental  results  show  a  clear  advantage  of  such  extra  supervision,  especially  in 
the  RoboCup  domain  where  sentences  are  long  and  data  is  scarce. 

Cocktail  has  very  low  precision  and  recall  in  the  RoboCup  domain.  The 
difficulty  apparently  lies  in  the  length  of  sentences  being  processed.  Cocktail’s 
deterministic  shift-reduce  framework  processes  a  sentence  only  from  the  beginning 
to  the  end.  If  it  fails  to  parse  the  beginning  of  a  sentence,  then  it  will  fail  to  parse 
the  rest  of  the  sentence.  In  contrast,  Wasp’s  chart  parsing  algorithm  takes  a  holistic 
view  of  a  sentence  and  is  very  efficient. 

Wasp  also  outperforms  Silt  in  terms  of  recall.  In  Silt,  transformation 
rules  are  learned  for  each  MRL  production  individually,  and  the  learned  rules  do 
not  necessarily  cooperate  to  give  a  complete  parse  of  a  training  sentence.  In  Wasp, 
an  extracted  SCFG  always  covers  the  entire  training  set. 

Wasp’s  performance  is  competitive  compared  to  Krisp.  Moreover,  as 
shown  in  the  learning  curves,  a  Wasp  parser  is  consistently  more  precise  than  a 
Krisp  parser  when  trained  on  small  data  sets. 

Overall,  our  experiments  show  that  Wasp  performs  competitively  compared 
to  other  methods  requiring  similar  supervision,  and  is  considerably  more  robust  than 
methods  based  on  deterministic  parsing  (e.g.  COCKTAIL). 

A  major  advantage  of  Wasp  over  methods  such  as  Scissor  and  ZC07  is 
that  it  does  not  require  any  prior  knowledge  of  the  NL  syntax  for  training.  There¬ 
fore,  porting  Wasp  to  another  NL  is  relatively  easy.  We  illustrate  this  by  evaluating 
Wasp’s  performance  on  the  multilingual  Geoquery  250  data  set.  The  languages 
being  considered  are  English,  Spanish,  Japanese  and  Turkish.  These  languages 
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Prec.  (%) 

Wasp 

Rec.  (%) 

F  (%) 

English 

95.42 

70.00 

80.76 

Spanish 

91.99 

72.40 

81.03 

Japanese 

91.98 

74.40 

82.86 

Turkish 

96.96 

62.40 

75.93 

Table  3.3:  Performance  of  Wasp  on  the  multilingual  Geoquery  data  set 

differ  in  terms  of  word  order:  Subject- Verb-Object  (SVO)  for  English  and  Span¬ 
ish,  and  Subject-Object- Verb  (SOV)  for  Japanese  and  Turkish.  They  also  differ  in 
terms  of  morphology:  English  and  Spanish  are  inflected  languages,  while  Japanese 
and  Turkish  are  agglutinative  languages,  where  words  are  formed  by  joining  many 
morphemes  together.  Each  combination  of  morphemes  creates  a  different  word,  so 
agglutinative  languages  tend  to  have  a  larger  vocabulary.  As  shown  in  Table  3.3  and 
Figure  3.11,  Wasp’s  performance  is  consistent  across  all  four  languages,  although 
recall  is  lower  for  Turkish.  The  reason  is  that  the  Turkish  corpus  has  a  larger  vocab¬ 
ulary  (Table  3.1),  and  the  extracted  rules  tend  to  be  less  general.  A  possible  solution 
is  to  split  words  into  morphemes  and  treat  each  morpheme  as  a  separate  token.  This 
has  been  done  by  hand  for  the  Japanese  corpus. 

Wasp  has  much  room  for  improvement  compared  to  methods  like  Scissor 
and  ZC07.  The  performance  gap  can  be  closed  by  using  word  alignments  derived 
from  the  augmented  parse  trees  used  for  training  SCISSOR,  in  place  of  the  automatic 
word  alignments  given  by  GIZA++  (Table  3.4).  This  form  of  extra  supervision  is 
shown  to  improve  the  precision  and  recall  of  Wasp  slightly.  However,  we  also 
found  that  the  choice  of  MRL  plays  an  important  role  in  semantic  parsing.  In  par¬ 
ticular,  for  the  Geoquery  domain,  Prolog  logical  forms  can  be  a  more  appropriate 
MRL  than  FunQL.  In  the  next  chapter,  we  will  explore  ways  to  extend  Wasp 
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Figure  3.11:  Learning  curves  for  Wasp  on  the  multilingual  Geoquery  data  set 
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Geoquery  (880  data  set) 

RoboCup 

Prec.  (%) 

Rec.  (%) 

F  (%) 

Prec.  (%) 

Rec.  (%) 

F  (%) 

Wasp 

87.2 

74.8 

80.5 

88.9 

61.9 

73.0 

+  hand-written 
word  alignments 

93.6 

74.1 

82.7 

94.6 

65.0 

76.8 

Scissor 

95.5 

77.2 

85.4 

90.0 

80.7 

85.1 

ZC07 

95.5 

83.2 

88.9 

- 

- 

- 

Table  3.4:  Performance  of  Wasp  with  extra  supervision 


to  handle  Prolog  logical  forms.  The  resulting  algorithm,  A-Wasp,  uses  the  same 
amount  of  supervision  as  Wasp,  and  is  shown  to  perform  comparably  to  ZC07,  and 
better  than  SCISSOR. 

3.4  Related  Work 

The  lexical  acquisition  algorithm  of  Wasp  can  be  seen  as  projecting  syn¬ 
tactic  structures  from  the  target  MRL  to  the  source  ML  via  an  automatically  word- 
aligned  parallel  corpus.  Besides  semantic  parsing  and  syntax-based  MT,  the  idea 
of  structural  projection  has  also  been  used  for  inducing  part-of-speech  taggers  and 
noun  phrase  bracketers  (Yarowsky  and  Ngai,  2001),  and  automating  the  annotation 
of  less-studied  languages  (Xia  and  Lewis,  2007;  Moon  and  Baldridge,  2007). 

The  problem  of  phrasal  coherence  (Section  3.2.3)  has  recently  caught  the 
attention  of  the  statistical  MT  community  (e.g.  Fox,  2002).  Several  syntax-aware 
word  alignment  models  have  been  proposed:  Cherry  and  Lin  (2006)  propose  a  dis¬ 
criminative  word  alignment  model  using  features  derived  from  a  synchronous  gram¬ 
mar.  Training  this  model,  however,  requires  a  small  set  of  hand- written  word  align¬ 
ments.  DeNero  and  Klein  (2007)  present  a  variant  of  the  HMM  alignment  model 
(Vogel  et  al.,  1996)  with  a  syntax-sensitive  distortion  probability  distribution.  Un- 
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like  Cherry  and  Lin  (2006),  their  method  does  not  require  hand-written  word  align¬ 
ments  for  training.  In  May  and  Knight  (2007),  word  alignments  are  made  coherent 
by  re-aligning  the  training  set  with  a  learned  syntax-based  translation  model.  Both 
DeNero  and  Klein’s  (2007)  and  May  and  Knight’s  (2007)  methods  can  be  used  in 
place  of  the  algorithm  described  in  Section  3.2.3  for  better  performance  of  Wasp. 

3.5  Chapter  Summary 

In  this  chapter,  we  formulated  the  semantic  parsing  problem  as  a  language 
translation  task,  where  NL  sentences  are  translated  into  formal  MRs  through  syn¬ 
chronous  parsing.  We  described  a  learning  algorithm  for  semantic  parsing  called 
Wasp.  The  input  to  the  learning  algorithm  is  a  set  of  training  sentences  coupled 
with  their  correct  MRs,  and  an  unambiguous  CFG  of  the  target  MRL,  which  is  as¬ 
sumed  to  be  variable-free.  The  output  from  the  learning  algorithm  is  an  SCFG, 
together  with  parameters  that  define  a  log-linear  distribution  over  parses  under  this 
grammar.  Lexical  acquisition  is  performed  using  off-the-shelf  word  alignment  mod¬ 
els.  Since  Wasp  does  not  require  any  prior  knowledge  of  the  NL  syntax  for  training, 
porting  Wasp  to  other  NLs  is  relatively  easy.  Experiments  showed  that  Wasp’s  per¬ 
formance  is  consistent  across  different  languages  and  domains,  and  is  competitive 
compared  to  the  currently  best  methods  requiring  similar  supervision. 
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Chapter  4 


Semantic  Parsing  with  Logical  Forms 


Formal  semantic  analysis  of  natural  languages  typically  uses  predicate  logic 
as  the  representation  language.  In  this  chapter,  we  extend  the  Wasp  semantic  pars¬ 
ing  algorithm  to  handle  logical  forms  (Wong  and  Mooney,  2007b).  The  resulting 
algorithm,  A-Wasp,  is  based  on  an  extended  version  of  SCFG  in  which  logical 
forms  are  generated  using  the  lambda  calculus.  It  is  shown  to  be  one  of  the  best¬ 
performing  systems  so  far  in  the  Geoquery  domain. 

4.1  Motivation 

Traditionally,  linguists  have  used  predicate  logic  to  represent  meanings  as¬ 
sociated  with  NL  expressions  (Montague,  1970;  Dowty  et  al.,  1981).  There  are 
many  different  kinds  of  predicate  logic  that  deal  with  different  linguistic  phenom¬ 
ena  such  as  quantification,  modality,  underspecification,  and  discourse.  A  common 
feature  of  these  logical  languages  is  the  use  of  logical  variables  to  denote  entities. 
For  example,  in  Figure  4.1,  the  logical  variables  x\  and  x2  are  used  to  denote  a  state 
and  the  area  of  a  state,  respectively. 

In  the  last  chapter,  we  showed  that  semantic  parsing  can  be  cast  as  a  machine 
translation  task,  where  an  SCFG  is  used  to  model  the  translation  of  an  NL  into  a 
formal  MRL.  But  the  use  of  SCFG  for  semantic  parsing  is  limited  to  variable-free 
MRLs,  because  SCFG  does  not  have  a  principled  mechanism  for  handling  logical 
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answer  (x\ ,  smallest  (X2,  (state  (aq)  ,  area  {x\,  X2)  )  )  ) 

What  is  the  smallest  state  by  area  ? 

Figure  4. 1 :  A  Prolog  logical  form  in  Geoquery  and  its  English  gloss 

variables.  This  is  unfortunate  because  most  existing  work  on  computational  seman¬ 
tics  is  based  on  predicate  logic  (Chamiak  and  Wilks,  1976;  Blackburn  and  Bos, 
2005).  For  some  domains,  this  problem  can  be  avoided  by  transforming  a  logical 
language  into  a  variable-free,  functional  language  such  as  FunQL  in  Geoquery. 
However,  development  of  such  a  functional  language  is  non-trivial,  and  as  we  will 
see,  logical  forms  can  improve  generalization  for  semantic  analysis. 

On  the  other  hand,  most  existing  methods  for  mapping  NL  expressions 
to  logical  forms  involve  substantial  hand-written  components  that  are  difficult  to 
maintain.  For  example,  Crouch  (2005)  describes  a  semantic  interpreter  based  on  a 
broad-coverage,  hand-written  lexical  functional  grammar  (Riezler  et  al.,  2002).  The 
English  Resource  Grammar  (Copestake  and  Flickinger,  2000)  is  another  human- 
engineered,  semantically-grounded  grammar  which  has  been  used  in  transfer-based 
spoken  language  translation.  Other  systems  contain  a  hand-written  rule-based  com¬ 
ponent  that  transforms  syntactic  derivations  into  semantic  representations  (Bayer 
et  al.,  2004;  Bos,  2005).  Compared  to  these  systems,  the  CCG-based  parsers  by 
Zettlemoyer  and  Collins  (2005,  2007)  are  much  easier  to  maintain,  but  they  still 
rely  on  a  small  set  of  hand-written  template  rules  for  generating  lexical  entries, 
which  can  create  a  knowledge-acquisition  bottleneck. 

In  this  chapter,  we  show  that  the  synchronous  parsing  framework  can  be 
used  to  translate  NL  sentences  into  logical  forms.  We  extend  the  SCFG  formalism 
by  adding  variable-binding  A-operators  to  the  MR  strings.  Complete  logical  forms 
are  then  generated  with  the  lambda  calculus  (Church,  1940),  which  is  commonly 
used  to  provide  a  compositional  semantics  for  NLs  (Montague,  1970;  Steedman, 
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2000).  We  call  the  extended  grammar  formalism  a  A-SCFG.  We  propose  a  learning 
algorithm  similar  to  Wasp,  which  learns  a  A-SCFG  from  a  set  of  training  sentences 
paired  with  their  correct  logical  forms,  together  with  parameters  that  define  a  log- 
linear  distribution  over  parses  under  the  A-SCFG.  We  call  the  extended  algorithm 
A-Wasp.  Experiments  show  that  A-Wasp  is  currently  one  of  the  best-performing 
semantic  parsing  algorithms  in  the  Geoquery  domain. 

4.2  The  A-Wasp  Algorithm 

This  section  describes  the  A-Wasp  algorithm.  We  first  define  the  A-SCFG 
formalism  (Section  4.2.1).  Then  we  introduce  the  basic  learning  algorithm  of  A- 
Wasp  (Sections  4.2.2  and  4.2.3).  While  reasonably  effective,  it  can  be  further  im¬ 
proved  through  transformation  of  logical  forms  (Section  4.2.4)  and  language  mod¬ 
eling  (Section  4.2.5). 

4.2.1  The  A-SCFG  Formalism 

To  see  why  it  is  problematic  to  use  an  SCFG  to  generate  logical  forms, 
consider  the  formal  query  in  Figure  4.1.  The  answer  to  this  formal  query,  which 
is  the  state  with  the  smallest  area,  is  denoted  by  x\.  Accordingly,  x\  occurs  three 
times  in  this  logical  form:  the  first  time  under  the  answer  predicate,  the  second 
time  under  state,  and  the  third  time  under  area.  An  SCFG  would  generate  these 
three  instances  of  X\  in  three  separate  steps  (Figure  4.2).  However,  it  is  very  difficult 
to  model  their  dependencies  because  of  the  context-free  assumption  of  an  SCFG. 
What  this  grammar  lacks  is  a  principled  mechanism  for  naming  logical  variables. 

To  make  it  possible  to  model  the  dependencies  between  logical  variables, 
we  introduce  variable-binding  A-operators  to  the  SCFG  formalism.  We  call  the 
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Query 


Query 


What  is  Form 


the  smallest  FORM  FORM 


state  by  area 
(a)  English 


answer  (xi,  FORM  ) 


smallest  (X2,  (Form  ,  Form  )) 


state  (xi)  area(xi,X2> 
(b)  Prolog 


Figure  4.2:  An  SCFG  parse  for  the  string  pair  in  Figure  4.1 


resulting  grammar  a  A-SCFG. 

Recall  that  in  an  SCFG,  each  rule  has  the  following  form: 

A^{a,P)  (4.1) 

where  A  is  a  non-terminal,  a  is  an  NL  string,  and  3  is  an  MRL  translation  of  a. 
Both  a  and  3  are  strings  of  terminal  and  non-terminal  symbols.  In  a  A-SCFG,  each 
rule  has  the  following  form: 


A  — >  (cr,  Axi . . .  A Xk-3)  (4.2) 

where  a  is  an  NL  string  and  3  is  an  MRL  translation  of  a.  Unlike  (4.1),  /3  is  a  string 
of  terminals,  non-terminals,  and  logical  variables.  The  variable-binding  operators 
A  bind  occurrences  of  the  logical  variables  xi, . . . ,  xk  in  /3,  and  make  Axi . . .  A xk-3 
a  A  -function  of  arity  k.  When  applied  to  a  list  of  arguments,  (x^, . . . ,  x%k ) .  the  A- 
function  gives  /3cr,  where  o  is  a  substitution,  (xi/x^, . . . ,  xk/xik},  that  replaces  all 
bound  occurrences  of  Xj  in  3  with  x,:i .  For  example,  in  the  following  expression: 

(Axi.Ax2.area  (xlrx2)  )(x2,x3) 
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Axi.Ax2.area  (xi,  x2)  is  a  A-function  of  arity  2  in  which  occurrences  of  xi  and 
x2  are  bound.  When  applied  to  (x2,  x3),  this  A-function  gives  area  (x2,  x3) . 

To  avoid  accidental  binding  of  variables,  if  any  of  the  arguments  appear 
in  (3  as  a  free  variable  (i.e.  not  bound  by  any  A-operators),  then  those  free  variables 
in  (3  must  be  renamed  before  function  application  takes  place.  For  example,  in  the 
following  A-function: 

Axi.  (state  (xi )  ,  next.to  (xi,  x2)  ,  equal  (x2,  stateid  ( texas )  )  ) 

x2  is  a  free  variable.  It  must  be  renamed  to  something  other  than  x2  so  that  this 
A-function  can  be  applied  to  (x2). 

Each  non-terminal  Aj  in  3  is  followed  by  a  list  of  kj  arguments  (kj  can  be 
0).  During  parsing,  Aj  must  be  rewritten  by  a  A-function  of  arity  kj.  As  with  an 
SCFG,  a  derivation  starts  with  a  pair  of  associated  start  symbols,  and  it  ends  when 
all  non-terminals  have  been  rewritten.  For  example,  Figure  4.3  shows  a  possible  A- 
SCFG  parse  of  the  string  pair  in  Figure  4.1.  The  yield  of  the  MR  parse  tree  (Figure 
4.3(b))  is  the  following  expression: 

answer  (xi, 

(Axi.smallest  (x2,  ( 

(Axi.state  (xi)  )(xi), 

(Axi.Ax2.area  (xi,x2)  )(xi,x2) 

)  ))(®i) 

) 

Each  A-function  in  this  expression  is  then  applied  to  its  corresponding  arguments 
in  a  bottom-up  manner,  resulting  in  an  MR  string  free  of  A-operators  with  logical 
variables  properly  named.  For  example,  given  the  above  expression,  we  first  ap¬ 
ply  Axi. state  (xi )  to  (xi),  and  Axi.Ax2.area  {x\ ,  x2 )  to(xi,x2).  This  results 
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Query 


Query 


What  is  Form 


the  smallest  FORM 


Form 


state  by  area 
(a)  English 


answer  (ii,  FORM(xi)  ) 


Axi. smallest  (#2,  (  Form(xi)  ,  Form(xi,X2)  )) 

I 

Axi.  state  (x\)  Ax1.Ax2.area  (xi,  X2) 
(b)  Prolog 


Figure  4.3:  A  A-SCFG  parse  for  the  string  pair  in  Figure  4.1 


in  two  MR  strings:  state  (x\)  and  area  (xi,  x2) .  These  two  strings  combine 
with  Axi. smallest  (...)  to  form  a  larger  A-function,  which  is  then  applied  to 
(xi).  The  resulting  string,  smallest  (...),  combines  with  answer  (xi,  .  .  . ) , 
giving  the  logical  form  in  Figure  4.1. 

The  following  rules  can  be  used  to  produce  the  parse  trees  in  Figure  4.3: 

QUERY  — >  (  what  is  ForMjyj  ,  answer  (xi ,  F0RMq](xi)  )  ) 

Form  — ■>  (  smallest  Formqj  Form^  , 

Axi. smallest  (x2,  (FORM^xi) ,  FORM^Xi,  x2) )  )  ) 
Form  — >•  (  state  ,  Ax!. state  (x1)  ) 

Form  — >  (  by  area  ,  Ax1.Ax2.area  (xi,  x2)  ) 

Note  that  non-terminals  in  the  NL  and  MR  strings  in  each  rule  are  indexed  with 
CD,  0, ...  to  show  their  association.  With  the  A-operators,  now  we  have  a  principled 
mechanism  for  naming  logical  variables  across  a  derivation. 

As  a  side  note,  the  first  two  rules  listed  above  can  be  reformulated  as  fol¬ 
lows: 

QUERY  — *  (  what  is  Form^  ,  Api. answer  (xi,pi(xi))  ) 

Form  — >  (  smallest  Form^j  Form^  , 

Api- Xp2.Xxi. smallest  (x2,  (pi(xi) ,  p2(xi,  x2) )  )  ) 
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In  other  words,  non-terminals  in  the  MR  strings  can  be  seen  as  bound  occurrences 
of  logical  variables,  pi,  that  abstract  over  A-functions  (Dowty  et  al.,  1981,  pp.  102- 
103).  The  names  of  these  logical  variables  correspond  to  the  non-terminal  indices 
in  the  NL  strings.  This  notation  of  higher-order  abstraction  has  been  widely  used 
in  the  linguistics  literature  (e.g.  Steedman,  2000).  However,  in  this  thesis,  we  will 
follow  our  synchronous  grammar  formulation  of  A-SCFG.  The  reason  is  two-fold: 

1 .  The  synchronous  grammar  formulation  makes  explicit  the  similarity  between 
SCFGs  and  A-SCFGs.  For  example,  a  A-SCFG  degenerates  into  an  SCFG 
if  none  of  its  rules  contain  any  occurrences  of  logical  variables.  As  we  will 
see,  the  lexical  acquisition  algorithms  for  SCFGs  and  A-SCFGs  are  also  very 
similar. 

2.  Compared  to  logical  variables,  non-terminals  can  provide  additional,  domain- 
specific  type  constraints. 

We  also  note  that  a  A-SCFG  can  be  seen  as  a  generalized  context-free  gram¬ 
mar  (GCFG)  (Pollard,  1984),  a  formalism  used  by  Weir  (1988)  to  characterize 
mildly  context-sensitive  grammars.  A  GCFG  is  context-free  in  the  sense  that  rewrit¬ 
ing  choices  in  a  derivation  are  independent  of  the  derivation  history.  It  can  be  shown 
that  this  is  the  case  for  a  A-SCFG. 

The  A-SCFG  formalism  is  also  close  to  LlNGOL  (Pratt,  1973),  a  linguistically- 
oriented  programming  language  designed  for  NLP  tasks  such  as  machine  translation 
and  semantic  parsing.  A  Lingol  program  can  be  seen  as  a  synchronous  grammar, 
in  which  each  grammar  rule  consists  of  an  NL  phrase  coupled  with  an  arbitrary 
Lisp  S-expression  (i.e.  a  small  program).  These  S-expressions  combine  to  produce 
an  analysis  of  a  complete  sentence.  In  the  A-SCFG  formalism,  such  expressions  are 
restricted  to  be  A-functions  which  combine  solely  through  function  application. 
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answer  {x\,  count  (X2,  (state  (X2)  ,  next.to  (X2,  X3)  , 

most  (£3,  X4,  ( state  (m3)  ,  nextrto  (m3,  X4)  ,  state  (X4)  )  )  )  ,X\)  ) 

How  many  states  border  the  state  that  borders  the  most  states  ? 

Figure  4.4:  A  Prolog  logical  form  in  GEOQUERY  and  its  English  gloss 

4.2.2  Lexical  Acquisition 

Given  a  set  of  training  sentences  paired  with  their  correct  logical  forms, 
{(ej,fj)},  the  first  learning  task  of  A-Wasp  is  to  find  a  A-SCFG,  G,  that  covers 
the  training  data.  Like  WASP,  we  construct  G  using  rules  extracted  from  word 
alignments.  We  illustrate  this  using  Figures  4.4—4. 7.  The  parse  tree  in  Figure  4.5 
is  obtained  using  an  unambiguous  CFG  for  Prolog  logical  forms.1  In  this  grammar, 
each  production  corresponds  to  a  formula.  Also  a  conjunction  operator  (, )  always 
combines  with  its  left  conjunct  to  avoid  ambiguity  in  the  Prolog  grammar.  Figure 

4.6  shows  a  sample  word  alignment  from  which  A-SCFG  rules  can  be  extracted 
using  the  algorithm  described  in  Sections  3.2. 1-3. 2. 3. 

However,  this  results  in  a  A-SCFG  where  logical  variables  are  never  bound. 
Basically,  A-SCFG  rules  should  be  extracted  from  a  word  alignment  based  on  an 
MR  parse  tree  where  logical  variables  are  explicitly  bound  by  A-operators  (Figures 

4.7  and  4.8). 

The  transformation  from  Figure  4.5  to  Figure  4.7  is  straightforward.  It 
can  be  done  in  a  bottom- up  manner,  starting  with  MRL  productions  with  no  non¬ 
terminals  on  the  RHS,  e.g.  Form  — ■>  state  (m4) .  For  each  production  A  — >  /3, 
a  logical  variable  m*  is  bound  whenever  m,  appears  in  /3  as  well  as  outside  the  MR 
sub-parse  rooted  at  A.  Such  logical  variables  need  to  be  bound  because  otherwise, 

'Although  we  focus  on  Prolog  logical  forms,  techniques  developed  in  this  chapter  should  be 
applicable  to  many  other  logical  languages. 
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Query 


answer  (ii,  FORM  ) 


count  (X2,  (  Form  ),xi) 
state (X2 )  ,  Form 
next_to  (X2,  x3)  ,  Form 


most  (23, 24,  (  Form  )) 
state  (23 )  ,  Form 
next.to  (23,  24)  ,  Form 

I 

state  (24) 

Figure  4.5:  A  parse  tree  for  the  logical  form  in  Figure  4.4 


they  would  be  renamed  during  function  application,  and  therefore,  become  invisible 
to  the  rest  of  the  logical  form.  Other  logical  variables  need  not  be  bound,  e.g.  those 
that  only  appear  in  /3  but  not  outside.  As  we  add  A to  j3,  we  also  add  Xi  to  the 
argument  list  that  follows  A  in  the  parent  MRL  production.  For  example,  in  Figure 
4.5,  the  logical  variable  x4  in  Form  — >  state  (24)  needs  to  be  bound  because  x4 
appears  under  the  most  and  next_to  predicates  as  well.  It  would  also  be  added 
to  the  argument  list  that  follows  Form  in  the  parent  MRL  production,  resulting  in 
Form  — >  next_to  (x3,  x4)  ,  Form(x4).  This  procedure  continues  upward  until 
the  root  of  the  MR  parse  tree  is  reached. 

Once  transformed  parse  trees  are  obtained  for  all  logical  forms  in  the  train¬ 
ing  set,  lexical  acquisition  proceeds  as  follows:  Train  a  word  alignment  model,  M, 
and  a  reverse  word  alignment  model,  M,  using  the  training  set,  {  (e,:,  f')},  where  f' 
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How 

many 

states 

border 

the 

state 

that 

borders 

the 

most 

states 

? 


QUERY  — >  answer  (mi ,  FORM) 
Form  — >  count  (m2,  (Form),  mi) 
Form  — >  state  ( m2 )  ,  Form 
Form  — >  next_to  ( m2 ,  m 3)  ,  Form 
Form  — >  most  (m3,  x±,  (Form)  ) 
Form  — >  state  (m3)  ,  Form 
Form  — >  next.to  (m3,  x±)  ,  Form 
Form  —>  state  (x^) 


Figure  4.6:  A  word  alignment  based  on  Figures  4.4  and  4.5 


are  the  transformed  MR  parse  trees.  During  the  training  of  M  and  M,  all  lambda  ab¬ 
stractions  and  variable  names  in  logical  forms  are  ignored  to  reduce  sparsity.  Obtain 
the  k- best  alignments,  a|  k,  and  the  best  reverse  alignment,  a*,  for  each  training 
example  (e,,  f/)  using  M  and  M .  Remove  bad  links  from  each  a*k,  and  replenish 
the  removed  links  by  adding  links  from  a*  (Section  3.2.3).  Then  extract  A-SCFG 
rules  from  a^  k  as  described  in  the  ACQUIRE-LEXICON  procedure  (Figure  3.6, 
lines  7-13),  while  merging  nodes  in  the  MR  parse  tree  if  necessary  (Section  3.2.2). 
The  extracted  A-functions  can  be  normalized  through  renaming  of  logical  variables, 
using  a  procedure  commonly  known  as  a-conversion  (Blackburn  and  Bos,  2005). 

4.2.3  Probabilistic  Model 

Since  a  learned  A-SCFG  can  be  ambiguous,  a  probabilistic  model  is  needed 
for  parse  disambiguation.  In  A-Wasp,  we  use  the  same  log-linear  model  as  WASP 
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Query 


answer  (ii,  Form(xi)  ) 


Axi.count  (X2,  (  Form(x2)  ),x\) 
Ax2-state  (X2)  ,  Form(x2) 
Ax2-next_to  (X2,  X3)  ,  Form(xs) 


Ax3.mOSt  (X3,  X4,  (  Form(x3,  X4)  )) 

AX3.AX4. state  (X3)  ,  Form(x3,®4) 

Ax3.Ax4.next_to  (X3,  X4)  ,  Form(x4) 

I 

AX4. state  (X4) 

Figure  4.7:  A  parse  tree  for  the  logical  form  in  Figure  4.4  with  A-operators 


(Section  3.2.4).  In  summary,  the  log-linear  model  defines  a  conditional  probability 
distribution  over  derivations  given  an  input  NL  sentence,  e: 

PrA(d|e)  =  — |r-  exp  V  Xifi(d)  (4.3) 

^A(e)  Y 

The  output  logical  form,  f*,  is  the  yield  of  the  most  probable  derivation  consistent 
with  the  input  sentence,  which  can  be  computed  in  cubic  time  with  respect  to  the 
sentence  length: 

f*  =  f  (  arg  max  exp  Y"  Ai/i(d)  ]  (4.4) 

\deD(G\e)  t  J 

The  following  feature  types  are  used  in  the  log-linear  model: 
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How 

many 

states 

border 

the 

state 

that 

borders 

the 

most 

states 

? 


QUERY  — >  answer  (£1 ,  Form(xi)  ) 

Form  — >  Axi. count  (x2r  (Form(x2))  ,xi) 
Form  — >  \x2- state  (0:2)  r  FormQ^) 

Form  — >  Ax2.next  _t  o  ( x2 ,  x3 )  ,  Form^) 
Form  — >  Almost  (£3,  x4,  (Form(x3,  X4 ) )  ) 
Form  — »  Ax3.Ax4. state  (x3) ,  Form(x3, x4) 
Form  — >  Ax3.Ax4.next_to  (x3,  x4)  ,  Form(x4) 
Form  — >  Ax4. state  {x4) 


Figure  4.8:  A  word  alignment  based  on  Figures  4.4  and  4.7 


•  For  each  A-SCFG  rule  r,  there  is  a  feature,  fr,  that  returns  the  number  of 
times  r  is  used  in  a  derivation. 

•  For  each  NL  word  w,  there  is  a  feature,  fw,  that  returns  the  number  of  times 
w  is  generated  from  word  gaps  in  a  derivation. 

•  Generation  of  previously  unseen  words  is  modeled  using  an  extra  feature, 
/*,  that  returns  the  total  number  of  words  generated  from  word  gaps  in  a 
derivation. 

Additional  language-modeling  features  specific  to  A-Wasp  will  be  introduced  in 
Section  4.2.5. 

The  model  parameters,  A,  are  estimated  by  maximizing  the  conditional  log- 
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likelihood  of  the  training  set.2  Details  of  the  parameter  estimation  algorithm  can  be 
found  in  Section  3.2.4. 


4.2.4  Promoting  Parse  Tree  Isomorphism 

In  the  previous  sections,  we  have  described  the  A-Wasp  algorithm  in  which 
logical  forms  are  produced  using  the  lambda  calculus.  While  reasonably  effective, 
it  can  be  further  improved  in  several  ways.  In  this  section,  we  focus  on  improving 
lexical  acquisition. 

To  see  why  the  current  lexical  acquisition  algorithm  can  be  problematic, 
consider  the  following  A-SCFG  rules  which  would  be  extracted  from  the  word 
alignment  in  Figure  4.8: 


Form 

Form 


Form 

Form 

Form 

Query 


(  states  ,  Ax4. state  (x4)  ) 

(  state  (1)  borders  (1)  most  FORM^j , 

Ax3.most  (x-s,  x\,  ( state  (aq)  ,  next_to  (£3,  x4)  , 

Forme(x4))  )  ) 

(  border  (1)  ForMqj  ,  Ax2.next_to  {x2rX3)  ,  FORM^aq)  ) 
(  states  ForMqj  ,  Xx2. state  (x2)  ,  FORM^aq)  ) 

(  how  many  ForMq-j  ,  Aaq. count  {x2r  (FORM^aq)  )  r  )  ) 
(  FORM[Y]  (1)  ,  answer  (xi,  FORM^aq) )  ) 


The  second  rule  is  based  on  the  combination  of  three  MRL  productions.  These  pro¬ 
ductions  are  combined  because  no  rules  can  be  extracted  for  the  production  Form 
— >  Ax3.Ax4.next_to  (...).  This  is  because  the  shortest  NL  substring  that  cov¬ 
ers  the  word  borders  and  the  argument  string  states ,  i.e.  borders  the  most  states , 
contains  the  word  most ,  which  is  linked  to  an  MRL  production  (most)  that  is  not 


2  While  the  use  of  the  symbol  A  for  log-linear  parameters  coincides  with  the  use  of  A  for  variable¬ 
binding  operators,  the  meaning  of  A  should  be  clear  from  the  context. 
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a  descendent  of  Form  — >  Ax3.Ax4.next_to  (...)  in  the  MR  parse  tree.  Rule 
extraction  is  forbidden  in  this  case  because  it  would  destroy  the  link  between  most 
and  most.  Same  for  the  production  Form  — >  Xx^.Xx^. state  (...).  These  two 
productions  are  combined  with  the  production  for  the  most  predicate  through  node 
merging  (Section  3.2.2).  Since  excessive  node  merging  can  lead  to  rules  that  are  too 
specific,  causing  overfitting,  it  is  desirable  to  have  NL  and  MR  parse  trees  that  are 
isomorphic,  or  close  to  isomorphic. 

As  mentioned  in  Section  3.4,  several  researchers  have  proposed  syntax- 
aware  word  alignment  models  to  promote  tree  isomorphism.  Here  we  use  a  dif¬ 
ferent  approach:  change  the  shape  of  an  MR  parse  tree  so  that  the  NL  and  MR 
parse  trees  are  maximally  isomorphic.  This  is  possible  because  the  conjunction 
operator  (, )  used  in  predicate  logic  is  both  associative  (a,  (b,c)  =  (a,  b)  ,  c 
=  a,  b,  c)  and  commutative  (a,  b  =  b,  a).3  Hence,  conjuncts  can  be  reordered 
and  regrouped  without  changing  the  meaning  of  a  conjunction.  Such  conjunct  re¬ 
ordering  and  regrouping  changes  the  shape  of  an  MR  parse  tree.  For  example,  rule 
extraction  would  be  possible  if  the  MR  sub-parse  for  the  formula  most  (  ...  )  is 
the  one  shown  in  Figure  4.9. 

We  present  a  method  for  regrouping  conjuncts  to  promote  isomorphism  be¬ 
tween  NL  and  MR  parse  trees.  It  requires  a  word  alignment  as  input.  This  regroup¬ 
ing  is  done  before  A-operators  are  added  (Section  4.2.2).  Given  a  conjunction,  it 
does  the  following: 

Step  1.  Identify  the  MRL  productions  that  correspond  to  the  conjuncts  and  the 
predicate  that  takes  the  conjunction  as  an  argument,  and  figure  them  as  vertices  in 
an  undirected  graph,  T.  For  example,  in  this  MR  parse  tree: 

3  While  our  discussion  focuses  on  the  conjunction  operator,  it  also  applies  to  other  operators  that 
are  associative  and  commutative,  e.g.  disjunction. 
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Form 


Ax3.most  (X3,  X4,  (  FORM(x3,a:4)  ,  Form(x4)  )) 
Ax3.Ax4.next_to  (X3,  X4)  ,  Form(xs)  AX4. state  (£4) 

I 

AX3. state  ( X3 ) 

Figure  4.9:  An  alternative  sub-parse  for  the  logical  form  in  Figure  4.4 


Form 


most(x3,X4,  (  Form  )) 
state  ( x'3 )  ,  Form 


next.to  {X3,  x\)  ,  Form 


the  productions  that  correspond  to  the  conjuncts  are: 

Form  — >  state  ( X3 ) 

Form  — >  next_to  (x^,  X4) 

Form  — >  state  (x^) 


and  the  production  that  corresponds  to  the  predicate  that  takes  the  conjunction  as 
an  argument  is: 


Form  — >  most  (x3,  m4.  Form) 


Each  of  these  productions,  denoted  by  p^,  is  figured  as  a  vertex  in  the  undirected 
graph  T.  For  convenience,  the  production  that  corresponds  to  the  predicate  that 
takes  the  conjunction  as  an  argument  has  a  special  name,  p0. 
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Step  2.  Add  an  edge  ( Pi,Pj )  to  F  if  there  exists  a  logical  variable  x  that  appears  in 
the  RHS  of  both  pi  and  p3.  For  example,  V  would  look  like  this: 


Form  — >•  most  ( x:i , x4.  Form) 

Form  —>  next.to  (x3,  x4) 
Form  — >  state  (£4) 


Each  edge  in  T  indicates  a  possible  edge  in  the  rearranged  MR  parse  tree.  Intu¬ 
itively,  two  concepts  are  closely  related  only  if  they  involve  the  same  logical  vari¬ 
ables,  and  closely-related  concepts  should  be  placed  close  together  in  the  MR  parse 
tree.  By  keeping  occurrences  of  a  logical  variable  in  close  proximity  in  the  MR 
parse  tree,  we  also  avoid  unnecessary  variable  bindings  in  the  extracted  rules. 

Step  3.  Let  s (i,j)  be  the  shortest  NL  substring  that  contains  all  the  words  that  are 
linked  to  p,  and  p3  in  the  input  word  alignment.  If  i.jj  ^  0  and  s (i,j)  contains  a 
word  that  is  linked  to  p0,  then  remove  the  edge  ( Pi,Pj )  from  F.  For  example,  F 
would  look  like  this  given  the  word  alignment  in  Figure  4.6: 


Form  — »  most  ( £3 ,  x4,  Form) 


Form  — >  state  (£3) 


•Form  — >  nextvto  (£3,  x4) 


Form  — >  state  (m4) 


An  edge  is  removed  because  the  shortest  NL  substring  that  contains  all  the  words 
that  are  linked  to  Lorm  — >•  next_to  (£3,  £4)  and  Lorm  ->state(£4),  i.e.  bor¬ 
ders  the  most  states ,  contains  the  word  most  which  is  linked  to  the  production 
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Form  — >  most  (m3,  m4,  Form)  .  Since  Form  — >  most  (m3,  m 4,  Form)  is  go¬ 
ing  to  be  the  root  of  the  rearranged  MR  parse  tree,  an  edge  between  Form  — > 
next_to  (m3,m4)  and  Form  — >  state  (m4)  would  prevent  a  A-SCFG  rule  from 
being  extracted  for  either  the  next_to  or  state  production. 

Step  4.  To  make  sure  that  V  is  a  connected  graph,  add  an  edge  (po ,  p, )  to  F  if  p,  is 
not  already  connected  to  p0  in  T. 

Step  5.  Assign  edge  weights  based  on  word  distance.  The  weight  of  an  edge  ( pi ,  pj ) 
is  defined  as  the  minimum  distance  between  the  words  that  are  linked  to  pi  and  pj. 
For  example,  the  edge  weights  for  T  given  the  word  alignment  in  Figure  4.6  would 
be: 


Form  — >•  most  (m3, m4.  Form) 

Form  — >  next.to  (m3,  m4) 
Form  — >  state  (m4) 


The  weight  of  the  edge  between  Form  — ►  most  (m3,  m4.  Form)  and  Form  — > 
state  (m3 )  is  4  because  the  words  most  and  state  are  4  words  apart  in  the  sentence. 
The  other  edge  weights  are  assigned  in  a  similar  way. 

Step  6.  Find  a  minimum  spanning  tree,  T,  for  T.  T  exists  because  T  is  a  connected 
graph  (see  Step  4).  T  can  be  found  using  Kruskal’s  algorithm  (Cormen  et  al.,  2001). 
Conjuncts  will  be  regrouped  based  on  T.  For  example,  for  the  weighted  graph  V 
shown  above,  the  minimum  spanning  tree  would  be: 
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Form  — >  most  (x3,  x^,  Form) 

— ►  next_to  (x3,  X4) 


Conjuncts  would  be  regrouped  such  that  there  is  an  edge  in  the  rearranged  MR  parse 
tree  between  Form  — >  most  {x3r  X4,  Form)  and  Form  — >  next_to  (x3r  X4) , 
and  so  on.  The  choice  of  T  reflects  the  intuition  that  words  that  occur  close  together 
in  a  sentence  tend  to  be  semantically  related. 

Step  7.  Finally,  using  p0  as  the  root,  construct  a  new  MR  parse  tree  based  on  T. 
Add  conjunction  operators  to  the  productions  as  necessary. 

In  summary,  conjuncts  are  regrouped  such  that  concepts  that  are  related  are 
placed  close  together  in  the  MR  parse  tree  (Steps  2,  5  and  6).  Also  the  NL  and  MR 
parse  trees  should  be  isomorphic  if  possible  (Step  3).  This  procedure  is  repeated 
for  all  conjunctions  that  appear  in  a  logical  form. 

Lexical  acquisition  then  proceeds  as  described  in  Section  4.2.2,  using  the 
same  word  alignments  used  for  conjunct  regrouping.  Figure  4.9  shows  the  rear¬ 
ranged  MR  parse  tree  based  on  the  minimum  spanning  tree  shown  above,  with 
A-operators  added.  With  this  MR  parse  tree,  the  following  A-SCFG  rules  would  be 
extracted: 

Form  — >  (  states  ,  Xx^. state  (£4)  ) 

Form  — >  (  state  ,  Xx3. state  ( x3 )  ) 

FORM  — >  (  Form^]  (1)  borders  ,  Ax3.Aa;4.next_to  (x3r  x4)  ,  FORM^o^)  ) 

Form  — >  (  Form^j  (1)  most  Form^  , 

Ax3.most  (x3,  £4,  (FORM^(:r3,  X4) ,  FORM^a^) )  )  ) 
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These  rules  are  considerably  shorter  than  those  shown  earlier  in  this  section,  and 
therefore  would  generalize  better. 

Note  that  the  conjunct  regrouping  procedure  requires  a  good  word  alignment 
to  begin  with,  and  this  requires  a  reasonable  ordering  of  conjuncts  in  the  training 
data,  since  the  word  alignment  model  (GIZA++)  is  sensitive  to  word  order.  This 
immediately  suggests  an  iterative  algorithm  in  which  a  better  grouping  of  conjuncts 
leads  to  a  better  alignment  model,  which  is  used  to  guide  further  regrouping  until 
convergence.  We  did  not  pursue  this  direction,  however,  because  in  the  restricted 
domain  we  worked  with,  GIZA++  seemed  to  perform  quite  well  without  re-training. 

4.2.5  Modeling  Logical  Languages 

In  this  section,  we  propose  two  methods  for  modeling  logical  languages. 
This  is  motivated  by  the  fact  that  many  of  the  errors  made  by  the  A-Wasp  semantic 
parser  can  be  detected  by  inspecting  the  MRL  translations  alone.  Figure  4. 10  shows 
some  typical  errors,  which  can  be  classified  into  two  broad  categories: 

1.  Type  mismatch  errors.  For  example,  a  state  cannot  possibly  be  a  river  (Figure 
4.10(a)).  Also  it  is  awkward  to  talk  about  the  population  density  of  a  state’s 
highest  point  (Figure  4.10(b)). 

2.  Errors  that  do  not  involve  type  mismatch.  For  example,  a  query  can  be  overly 
trivial  (Figure  4.10(c)),  or  involve  aggregate  functions  on  a  known  singleton 
(Figure  4.10(d)). 

The  first  type  of  errors  can  be  fixed  by  type  checking.  Each  m-place  pred¬ 
icate  is  associated  with  a  list  of  m-tuples  showing  all  valid  combinations  of  entity 
types  that  the  m  arguments  can  denote: 
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(a)  answer  (aq,  largest  (aq,  (state  (aq)  ,  major  (aq)  ,  river  (aq)  , 

traverse  (aq,aq)  )  )  ) 

What  is  the  entity  that  is  a  state  and  also  a  major  river,  that  traverses  some¬ 
thing  that  is  the  largest? 

(b)  answer  (xi,  smallest  ( x-i ,  (highest  (aq,  (place  (aq)  , 

loc  (aq,  x3)  ,  state  (x3)  )  )  ,  density  (aq,  aq)  )  )  ) 

Among  the  highest  points  of  all  states,  which  one  has  the  lowest  population 
density? 

(c)  answer  (aq,  equal  (aq,  stateid  (alaska)  )  ) 

Alaska? 

(d)  answer  (aq,  largest  (X2,  (largest  (aq,  (state  (Xi)  , 

next_to  (x\,  X3)  ,  state  (£3)  )  )  ,  population  (x±,  X2)  )  )  ) 

Among  the  largest  state  that  borders  some  other  state,  which  is  the  one  with 
the  largest  population  ? 

Figure  4.10:  Typical  errors  made  by  A-Wasp  with  English  interpretations 
point  (_)  :  {(POINT)} 

density  (_,_)  :  {(COUNTRY,  NUM),  (STATE,  NUM),  (CITY,  NUM)} 

These  m-tuples  of  entity  types  are  given  as  domain  knowledge.4  The  parser  main¬ 
tains  a  set  of  possible  entity  types  for  each  logical  variables  introduced  in  a  par¬ 
tial  derivation  (except  those  that  are  no  longer  visible).  If  there  is  a  logical  vari¬ 
able  that  cannot  denote  any  type  of  entity  (i.e.  its  set  of  entity  types  is  empty), 
then  the  partial  derivation  is  considered  invalid.  For  example,  based  on  the  tuples 
shown  above,  point  (ag)  and  density  (oq,_)  cannot  be  both  true,  because 
{point}  n  {country,  state,  city}  =  0.  The  use  of  type  checking  is  to  exploit 

4Note  that  the  same  entity  type  information  is  encoded  in  the  non-terminal  symbols  in  FunQL 
(Appendix  A.2),  so  this  is  not  additional  domain  knowledge  compared  to  what  is  used  in  Wasp. 
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the  fact  that  people  tend  not  to  ask  questions  that  obviously  have  no  valid  answers 
(Grice,  1975).  It  is  also  similar  to  Schuler’s  (2003)  use  of  model-theoretic  interpre¬ 
tations  to  guide  syntactic  parsing. 

Errors  that  do  not  involve  type  mismatch  are  handled  by  adding  new  features 
to  the  log-linear  model  (Section  4.2.3).  We  only  consider  features  that  are  based 
on  the  MRL  translations,  and  therefore,  these  features  can  be  seen  as  an  implicit 
language  model  of  the  target  MRL  (Papineni  et  al.,  1997).  Of  the  many  feature 
types  that  we  have  tried,  one  feature  type  stands  out  as  being  the  most  effective, 
namely  the  two-level  rules  in  Collins  and  Koo  (2005),  which  gives  the  number  of 
times  a  given  rule  is  used  to  rewrite  a  non-terminal  in  a  given  parent  rule.  We  use 
only  the  MRL  part  of  the  rules.  Lor  example,  a  negative  weight  for  the  combination 
of  QUERY answer  (xi,  PORM(xi) )  and  Lorm  Xx±. equal  would 

discourage  any  parse  that  yields  Ligure  4.10(c).  The  two-level-rules  features,  along 
with  the  features  described  in  Section  4.2.3,  are  used  in  the  final  version  of  A-Wasp. 

4.3  Experiments 

In  this  section,  we  describe  our  experiments  on  A-Wasp  and  analyze  the 
experimental  results. 

4.3.1  Data  Sets  and  Methodology 

We  evaluated  A-Wasp  in  the  Geoquery  domain,  using  the  same  data  set 
that  we  used  for  evaluating  Wasp  (Section  3.3.1).  We  used  the  original  Prolog 
logical  forms,  and  Table  4.1  shows  the  corpus  statistics. 

We  performed  standard  10-fold  cross  validation  in  our  experiments,  using 
precision,  recall  and  L-measure  as  the  evaluation  metrics  (Equations  3. 7-3. 9).  We 
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Geoquery 

MRL 

Prolog 

#  non-terminals 

14 

#  productions 

50 

#  examples 

250 

880 

NL 

English 

Spanish 

Japanese 

Turkish 

English 

Avg.  sent,  length 

6.87 

7.39 

9.11 

5.76 

7.57 

#  unique  words 

165 

159 

158 

220 

280 

Table  4.1:  Corpora  used  for  evaluating  A- Wasp 


supplied  the  same  set  of  initial  rules  to  the  learned  semantic  parsers  as  described  in 
Section  3.3.2.  These  initial  rules  represent  knowledge  needed  for  translating  basic 
domain  entities,  such  as  city  names  and  river  names. 

4.3.2  Results  and  Discussion 

Table  4.2  shows  the  performance  of  A-Wasp  on  the  Geoquery  880  data 
set  with  full  training  data,  compared  to  Wasp  and  three  other  algorithms: 

•  Krisp  (Kate  and  Mooney,  2006),  an  SVM-based  parser  using  string  kernels. 

•  SCISSOR  (Ge  and  Mooney,  2006),  a  combined  syntactic-semantic  parser  with 
discriminative  reranking. 

•  Zettlemoyer  and  Collins  (2007)  (abbreviated  as  ZC07),  a  probabilistic  parser 
based  on  relaxed  CCG  grammars. 

We  restrict  our  comparison  to  these  three  algorithms  because  they  were  shown  to 
outperform  Wasp  in  the  Geoquery  domain  in  Section  3.3.3.  Both  A-Wasp  and 
ZC07  use  Prolog  logical  forms  as  the  target  MRL.  The  other  systems,  Wasp,  Krisp 
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Geoquery  (880  data  set) 

Prec.  (%) 

Rec.  (%) 

F  (%) 

A-Wasp 

92.0 

86.6 

89.2 

Wasp 

87.2 

74.8 

80.5 

Krisp 

93.3 

71.7 

81.1 

Scissor 

95.5 

77.2 

85.4 

ZC07 

95.5 

83.2 

88.9 

Table  4.2:  Performance  of  A-Wasp  on  the  Geoquery  880  data  set 

and  Scissor,  use  the  functional  query  language  FunQL  developed  for  the  Geo¬ 
query  domain  (Appendix  A.2).  The  best-performing  systems  are  shown  in  bold 
in  Table  4. 22  Figure  4.11  shows  the  precision  and  recall  learning  curves. 

A  few  observations  can  be  made.  First,  algorithms  that  use  Prolog  logical 
forms  as  the  target  MRL  generally  show  better  recall  than  those  using  FunQL.  In 
particular,  A-Wasp  has  the  best  recall  among  all  systems.  The  main  reason  is  that 
A-Wasp  allows  lexical  items  to  be  combined  in  ways  not  allowed  by  FunQL  or  the 
hand-written  template  rules  in  ZC07.  For  example,  under  FunQL  and  ZC07,  it  is 
impossible  to  combine  the  most  predicate  with  its  arguments  as  illustrated  in  Fig¬ 
ure  4.9.  Nor  is  it  possible  to  combine  the  smallest  predicate  with  its  arguments 
as  illustrated  in  Figure  4.3(b).  These  examples  show  that  A-Wasp  is  more  flexible 
and  can  handle  a  wider  variety  of  logical  forms  than  previous  approaches.  Despite 
its  slightly  lower  precision  compared  to  Krisp,  SCISSOR  and  ZC07,  A-Wasp  has 
the  best  F-measure  overall  in  the  Geoquery  domain. 

To  see  the  relative  importance  of  each  component  of  the  A-Wasp  algorithm, 
we  performed  two  ablation  studies.  First,  we  compared  the  performance  of  A-Wasp 

5As  with  Table  3.2,  no  statistical  test  was  performed. 
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(a)  Precision 


(b)  Recall 


Figure  4.11:  Learning  curves  for  A-Wasp  on  the  Geoquery  880  data  set 
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Geo  880 

Geo  880 

(%) 

Prec. 

Rec. 

(%) 

Prec. 

Rec. 

A-Wasp 

91.95 

86.59 

A-Wasp 

91.95 

86.59 

w/o  conj.  regrouping 

90.73 

83.07 

w/o  two-level  rules 

88.46 

84.32 

and  w/o  type  checking 

65.45 

63.18 

Table  4.3:  Performance  of  A-Wasp  with  different  components  removed 


with  and  without  conjunct  regrouping  (Section  4.2.4).  Second,  we  compared  the 
performance  of  A-Wasp  with  and  without  language  modeling  for  the  target  logical 
language  (Section  4.2.5).  Table  4.3  shows  the  results  on  the  GEOQUERY  880  data 
set.  Using  paired  t -tests  to  determine  statistical  significance,  we  found  that  con¬ 
junct  regrouping  improves  recall  significantly  (p  <  0.01),  and  the  use  of  two-level- 
rules  features  in  the  probabilistic  model  improves  precision  and  recall  (p  <  0.05). 
Type  checking  also  significantly  improves  precision  and  recall  ip  <  0.001).  The 
best-performing  systems,  as  well  as  those  systems  whose  performance  shows  no 
significant  difference,  are  shown  in  bold  in  Table  4.3. 

A  major  advantage  of  A-Wasp  over  SCISSOR  and  ZC07  is  that  it  does  not 
require  any  prior  knowledge  of  the  NL  syntax.  Hence  it  is  straightforward  to  apply 
A-Wasp  to  other  NLs  for  which  training  data  is  available.  Table  4.4  shows  the 
performance  of  A-Wasp  on  the  multilingual  Geoquery  data  set.  It  shows  that  A- 
Wasp  performed  comparably  for  all  four  NLs  being  considered:  English,  Spanish, 
Japanese  and  Turkish.  It  achieved  the  same  level  of  precision  as  Wasp  (differences 
are  not  statistically  significant  based  on  paired  /-tests).  For  Spanish  and  Japanese, 
A-Wasp  has  better  recall  and  F-measure  than  Wasp  ( p  <  0.05).  Figure  4.12  shows 
the  precision  and  recall  learning  curves  for  A-Wasp  on  the  multilingual  Geoquery 
data  set. 
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Figure  4.12:  Learning  curves  for  A-Wasp  on  the  multilingual  Geoquery  data  set 
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Prec.  (%) 

\-Wasp 

Rec.  (%) 

F  (%) 

Prec.  (%) 

Wasp 

Rec.  (%) 

F  (%) 

English 

91.76 

75.60 

82.90 

95.42 

70.00 

80.76 

Spanish 

92.48 

80.00 

85.79 

91.99 

72.40 

81.03 

Japanese 

90.99 

81.20 

85.82 

91.98 

74.40 

82.86 

Turkish 

90.36 

68.80 

78.12 

96.96 

62.40 

75.93 

Table  4.4:  Performance  of  A-Wasp  on  the  multilingual  Geoquery  data  set 


4.4  Chapter  Summary 

In  this  chapter,  we  described  the  A-Wasp  semantic  parsing  algorithm,  an 
extended  version  of  Wasp  which  handles  MRLs  containing  logical  variables,  such 
as  predicate  logic.  Underlying  A-Wasp  is  the  A-SCFG  formalism,  which  generates 
logical  forms  using  the  lambda  calculus.  We  described  a  learning  algorithm  similar 
to  Wasp,  whose  output  is  a  A-SCFG,  together  with  parameters  that  define  a  log- 
linear  distribution  over  parses.  We  further  refined  the  learning  algorithm  through 
transformation  of  logical  forms  and  language  modeling  for  target  MRLs.  Using 
the  same  amount  of  supervision,  A-Wasp  significantly  outperforms  Wasp,  and  is 
currently  one  of  the  best  semantic  parsing  algorithms  in  the  Geoquery  domain. 
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Chapter  5 


Natural  Language  Generation  with  Machine 

Translation 


This  chapter  explores  a  different  task  from  semantic  parsing,  namely  natural 
language  generation.  We  focus  on  the  sub-task  of  tactical  generation,  in  which  state¬ 
ments  written  in  a  formal  MRL  are  mapped  into  NL  sentences  (Wong  and  Mooney, 
2007a).  We  show  that  an  effective  tactical  generation  system  can  be  obtained  by  in¬ 
verting  the  Wasp  semantic  parsing  algorithm  (Chapter  3).  Our  approach  allows  the 
same  learned  synchronous  grammar  to  be  used  for  both  parsing  and  generation.  In 
this  chapter,  we  consider  variable-free  MRLs  such  as  FunQL  and  CLANG  (Section 
2.1).  Generation  from  logical  forms  will  be  discussed  in  Chapter  6. 

5.1  Motivation 

Traditionally,  there  are  several  NLP  tasks  that  involve  the  generation  of  NL 
sentences,  e.g.  natural  language  generation,  machine  translation,  text  summariza¬ 
tion,  and  dialog  systems.  The  goal  of  natural  language  generation  (NLG)  is  to  pro¬ 
duce  NL  sentences  from  computer-internal  representations  of  information.  NLG 
can  be  divided  into  two  sub-tasks:  (1)  strategic  generation,  which  decides  what 
meanings  to  express,  and  (2)  tactical  generation,  which  generates  NL  sentences 
that  express  those  meanings.  This  chapter  is  concerned  with  the  latter  task  of  tacti¬ 
cal  generation.  In  this  work,  we  assume  that  statements  written  in  a  formal  MRL, 
produced  by  an  external  content  planner,  are  given  to  a  tactical  generator  as  input. 
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As  with  NLG,  the  task  of  machine  translation  (MT)  involves  the  generation 
of  NL  sentences.  NLG  is  mainly  associated  with  MT  in  the  context  of  interlingual 
and  transfer-based  MT,  where  an  NLG  component  is  used  to  generate  the  target 
language  from  abstract  meaning  representations  (Wilks,  1973;  Nyberg  and  Mita- 
mura,  1992;  Gao  et  al.,  2006).  Despite  their  similar  goals,  there  has  been  little,  if 
any,  research  on  exploiting  recent  MT  methods  for  NLG.  Specifically,  it  is  easy  to 
use  statistical  MT  to  construct  a  tactical  generator,  given  a  corpus  of  NL  sentences 
coupled  with  their  MRs.  In  this  chapter,  we  present  results  on  using  a  recent  phrase- 
based  statistical  MT  system,  Pharaoh  (Koehn  et  al.,  2003),  for  NLG.  Although 
moderately  effective,  the  inability  of  PHARAOH  to  exploit  the  formal  structure  and 
grammar  of  the  MRL  limits  its  accuracy.  Unlike  natural  languages,  MRLs  typi¬ 
cally  have  a  simple,  formal  syntax  to  support  effective  automated  processing  and 
inference.  This  MRL  structure  can  also  be  used  to  improve  language  generation. 

Tactical  generation  can  also  be  seen  as  the  inverse  of  semantic  parsing.  In 
this  chapter,  we  show  how  to  invert  the  Wasp  semantic  parsing  algorithm  to  pro¬ 
duce  a  more  effective  generation  system.  As  shown  in  Chapter  3,  Wasp  exploits  the 
formal  syntax  of  the  MRL  by  learning  a  translator  based  on  an  SCFG  that  maps  an 
NL  sentence  to  an  MR  parse  tree  rather  than  to  a  flat  MR  string.  In  addition  to  ex¬ 
ploiting  the  formal  MRL  grammar,  our  approach  also  allows  the  same  learned  gram¬ 
mar  to  be  used  for  both  parsing  and  generation,  an  elegant  property  that  has  been 
widely  advocated  (Section  2.3.1).  We  call  our  new  generation  algorithm  Wasp 

While  reasonably  effective,  both  Pharaoh  and  Wasp-1  can  be  substan¬ 
tially  improved  by  borrowing  ideas  from  each  other.  In  subsequent  sections,  we 
show  how  the  idea  of  generating  from  MR  parse  trees  rather  than  flat  MRs,  used 
effectively  in  Wasp-1,  can  also  be  exploited  in  Pharaoh.  A  version  of  Pharaoh 
that  exploits  this  approach  is  experimentally  shown  to  produce  more  accurate  gen- 
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erators  that  are  more  competitive  with  Wasp-1’s.  We  also  show  how  aspects  of 
Pharaoh’s  phrase-based  model  can  be  used  to  improve  Wasp-1,  resulting  in  a 
hybrid  system  whose  performance  is  the  best. 

Overall,  we  show  that  effective  tactical  generation  systems  can  be  obtained 
by  exploiting  statistical  MT  methods.  This  is  achieved  by  treating  tactical  gener¬ 
ation  as  a  language  translation  task  in  which  formal  MRs  are  translated  into  NL 
sentences.  Furthermore,  we  show  that  tactical  generation  can  be  formalized  as  syn¬ 
chronous  parsing  (Section  2.4),  as  is  the  case  with  semantic  parsing  and  MT. 

5.2  Generation  with  Statistical  Machine  Translation 

In  this  section,  we  show  how  statistical  MT  methods  can  be  used  to  con¬ 
struct  tactical  generators.  We  first  describe  a  tactical  generation  algorithm  based  on 
Pharaoh,  a  phrase-based  statistical  MT  system  (Section  5.2.1).  Then  we  intro¬ 
duce  Wasp-1  (Section  5.2.2),  a  tactical  generation  algorithm  which  is  the  inverse 
of  the  Wasp  semantic  parsing  algorithm. 

We  consider  source  MRLs  that  are  variable-free.  We  also  assume  that  the 
order  in  which  MR  symbols  appear  is  relevant,  i.e.  the  order  can  affect  the  meaning 
of  the  MR.  Note  that  the  order  in  which  MR  symbols  appear  need  not  be  the  same 
as  the  word  order  of  the  target  NL,  and  therefore,  the  content  planner  need  not  know 
about  the  target  NL  grammar  (Shieber,  1993). 

To  ground  our  discussion,  we  consider  two  domains  previously  used  to  test 
Wasp’s  semantic  parsing  ability,  namely  Geoquery  and  RoboCup  (Section  2.1). 
In  the  Geoquery  domain,  the  task  is  to  translate  formal  queries  into  NL  queries. 
Figure  5.1(a)  shows  a  sample  formal  query  and  its  English  translation.  In  the 
RoboCup  domain,  the  task  is  to  translate  formal  advice  given  to  soccer-playing 
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answer (state (traverse_l (rive rid ( '  ohio' ) ) ) ) 

What  states  does  the  Ohio  run  through? 

(a)  A  formal  query  written  in  FunQL 

( (bowner  our  {4})  (do  our  {6}  (pos  (left  (half  our) ) ) ) ) 
If  our  player  4  has  the  ball,  then  our  player  6  should  stay  in  the  left  side  of  our  half. 
(b)  A  piece  of  formal  advice  written  in  CLANG 

Figure  5.1:  Sample  meaning  representations  and  their  English  glosses 

agents  into  English.  Figure  5.1(b)  shows  a  piece  of  sample  advice  and  its  English 
translation.  Such  generation  systems  can  be  useful  in  the  parse  disambiguation 
scenario:  If  a  semantic  parser  finds  its  NL  input  ambiguous  and  produces  multiple 
alternative  formal  interpretations,  the  competing  interpretations  can  be  paraphrased 
back  into  NL  through  a  tactical  generator,  so  that  the  user  can  pick  a  correct  inter¬ 
pretation  based  on  the  NL  translations.  The  chosen  interpretation  can  then  be  used 
for  further  processing.  It  can  also  be  used  as  a  new  training  example  to  improve  the 
semantic  parser. 

5.2.1  Generation  Using  Pharaoh 

We  start  with  a  generation  system  based  on  Pharaoh.  Pharaoh  (Koehn 
et  al.,  2003)  is  a  statistical  MT  system  that  uses  phrases  as  basic  translation  units. 
During  decoding,  the  source  sentence  is  segmented  into  a  number  of  sequences  of 
consecutive  words  (or  phrases).  These  phrases  are  then  reordered  and  translated 
into  phrases  in  the  target  language,  which  are  joined  together  to  form  the  output 
sentence.  Compared  to  earlier  word-based  methods  such  as  IBM  Models  (Section 
2.5.1),  phrase-based  methods  such  as  PHARAOH  are  much  more  effective  in  produc¬ 
ing  idiomatic  translations,  and  are  currently  among  the  best-performing  methods  in 
statistical  MT  (Koehn  and  Monz,  2006). 
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A  main  component  of  PHARAOH  is  a  lexicon  consisting  of  bilingual  phrase 
pairs.  These  phrase  pairs  are  extracted  from  a  training  corpus  of  sentences  coupled 
with  their  translations.  Using  GIZA++,  the  best  word  alignments  for  each  training 
example  are  first  obtained  (Section  2.5.1).  A  lexicon  is  then  formed  by  collecting 
all  phrase  pairs  that  are  consistent  with  these  word  alignments. 

To  discriminate  good  translations  from  bad  ones,  Pharaoh  uses  a  log- 
linear  model  that  defines  a  conditional  probability  distribution  over  translations: 

i 

PrA(e|f)  cx  Pr(e)Al  J](P(/J|e,)A2P(el|/4)A3P^(/l|e,)A4P^(el|/J)A5 

2=1 

d(i  -  1,  i)Xe  exp(— |ej|)A7  exp(— l)As)  (5.1) 

where  f  is  an  input  sentence,  and  e  is  a  translation  of  f .  Pr(e)  is  the  language  model, 
e,  and  /,  are  the  phrases  that  comprise  e  and  f.  P(e|/)  and  P(/|e)  are  the  relative 
frequencies  of  e  and  /,  and  Pw(e\f)  and  Pw{f\e)  are  the  lexical  weights  (Koehn 
et  al.,  2003).  The  distortion  model,  gives  the  cost  of  phrase  reordering 

based  on  the  distance  between  the  Pth  and  j-th  phrases.  Both  the  word  penalty, 
exp(— \e\),  and  the  phrase  penalty,  exp(— 1),  allow  some  control  over  the  output 
translation  length.  The  model  parameters,  A,  are  trained  using  minimum  error-rate 
training  (Och,  2003).  The  output  translation,  e*,  given  an  input  sentence,  f ,  is: 

e*  =  argmaxPrA(e|f)  (5.2) 

e 

This  can  be  efficiently  approximated  through  beam  search. 

To  use  Pharaoh  for  tactical  generation,  we  simply  treat  the  source  MRL 
as  an  NL,  so  that  phrases  in  the  MRL  are  sequences  of  consecutive  MR  symbols. 
Figure  5.2  illustrates  the  generation  process.  Note  that  the  grammaticality  of  MRs 
is  not  an  issue  here,  since  they  are  given  as  input  and  are  guaranteed  to  be  gram¬ 
matical. 
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Figure  5.2:  Generation  using  Pharaoh 


5.2.2  Wasp  Generation  by  Inverting  Wasp 

Tactical  generation  can  be  seen  as  the  inverse  of  semantic  parsing.  In  this 
section,  we  show  how  to  invert  the  Wasp  semantic  parsing  algorithm  to  produce 
Wasp-1,  and  use  it  for  tactical  generation. 

Recall  that  in  WASP,  the  semantic  parsing  problem  is  formulated  as  a  lan¬ 
guage  translation  task,  where  NL  sentences  are  translated  into  formal  MRs  using 
an  SCFG.  Since  an  SCFG  is  fully  symmetric  with  respect  to  both  generated  strings, 
it  can  also  serve  as  the  underlying  formalism  for  generation.  Figure  5.3  gives  an 
overview  of  the  WASP-1  algorithm.  The  shaded  boxes  show  the  components  of  the 
algorithm  that  are  different  from  Wasp  (cf.  Figure  3.3).  Since  Wasp  and  Wasp-1 
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Training 


Figure  5.3:  Overview  of  the  Wasp  1  tactical  generation  algorithm 


share  the  same  grammar,  the  lexical  acquisition  component  is  the  same  for  both 
algorithms.  However,  as  we  will  see  shortly,  the  probabilistic  model  of  Wasp-1  is 
different  from  Wasp,  and  as  a  result,  Wasp-1  uses  a  slightly  different  decoder. 

Given  an  input  MR,  f,  Wasp-1  finds  a  sentence  e  that  maximizes  the  condi¬ 
tional  probability  Pr(e|f).  It  is  difficult  to  directly  model  Pr(e|f),  however,  because 
it  has  to  assign  probabilities  to  output  sentences  that  are  not  grammatical.  There  is 
no  such  requirement  for  semantic  parsing  with  Wasp,  because  the  use  of  the  MRL 
grammar  ensures  the  grammaticality  of  all  MRL  translations.  For  generation,  it  is 
often  hard  to  judge  the  grammaticality  of  an  output  sentence  due  to  the  inherent 
complexity  of  natural  languages. 

This  motivates  the  noisy-channel  framework  for  Wasp-1,  where  Pr(e|f)  is 
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divided  into  two  smaller  components  that  are  easier  to  model: 


e*  =  argmaxPr(e|f)  =  argnraxPr(e)  Pr(f|e)  (5.3) 

e  e 

In  this  framework,  Pr(e)  is  the  language  model ,  and  Pr(f|e)  is  the  translation 
model.  The  generation  task  is  to  find  an  output  NL  translation,  o\  such  that  (1) 
it  is  a  good  sentence  a  priori,  and  (2)  it  preserves  the  meaning  of  the  input  MR.  For 
the  language  model,  we  use  an  n-gram  model,  which  has  been  found  very  useful  in 
ranking  candidate  generated  sentences  (Knight  and  Hatzivassiloglou,  1995;  Banga¬ 
lore  et  al.,  2000;  Langkilde-Geary,  2002).  For  the  translation  model,  we  re-use  the 
log-linear  model  of  Wasp  (Equation  3.4).  Hence  computing  e*  means  maximizing 
the  following: 


maxPr(e)  Pr(f|e) 

e 

«  max  Pr(e(d))  Pr\(d|e(d)) 

deD(G|f) 


max 

deU(G|f) 


Pr(e(d))  -exp^TA^(d) 
^(e(d)) 


(5.4) 


where  -D(G|f)  is  the  set  of  all  derivations  that  are  consistent  with  f  under  an  SCFG, 
G,  and  e(d)  is  the  output  sentence  that  a  derivation  d  yields.  The  second  line  is 
due  to  the  assumption  that  Pr(f  |e)  =  XmeD(G|f)  Pr(d|e)  is  approximated  by  the 
Viterbi  likelihood,  maxdg£)(G'|f)  Pr(d|e). 


Learning  under  the  noisy-channel  framework  thus  involves  two  steps.  First, 
a  back-off  n-gram  language  model  with  Good-Turing  discounting  and  no  lexical 
classes1  is  built  from  the  training  sentences  using  the  Srilm  toolkit  (Stolcke,  2002). 
We  use  n  =  2  since  higher  values  seemed  to  cause  overfitting  in  our  experiments. 
Then  a  translation  model  is  trained  as  described  in  Section  3.2. 


'This  is  to  ensure  that  the  same  language  model  is  used  in  all  systems  that  we  tested. 
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If 


Rule  — >  (Condition  Directive) 


our 

player 

4 


Condition^  (bowner  Team  {Unum}) 


has 

the 

ball 


Team  — ►  our 


Unum  — >  4 


Figure  5.4:  A  word  alignment  between  English  and  CLANG  (cf.  Figure  3.5) 

Compared  to  most  existing  work  on  generation,  Wasp-1  has  the  following 
characteristics: 

1.  It  does  not  require  any  lexical  information  in  the  input  MR,  so  lexical  selec¬ 
tion  is  an  integral  part  of  the  decoding  algorithm. 

2.  A  lexical  item  may  consist  of  multiple  words.  Moreover,  it  can  be  discon¬ 


tiguous. 


The  second  characteristic  is  evident  when  we  consider  the  following  SCFG  rule, 
which  can  be  extracted  from  the  word  alignment  in  Figure  3.5,  which  is  reproduced 
here  in  Figure  5.4  for  convenience: 

Condition  — ►  (  Team  ^player  Unum^  has  (1)  ball , 


(bowner  TeAMqj  {UNUMgj})  ) 


In  this  SCFG  rule,  the  NL  string  contains  a  sequence  of  non-consecutive  words,  as 
in  our  player  4  has  the  ball.  This  lexical  item  is  therefore  discontiguous. 

For  decoding,  we  use  an  Earley  chart  generator  that  scans  the  input  MR 
from  left  to  right.  This  is  possible  because  it  is  assumed  that  the  order  in  which 
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MR  symbols  appear  is  fixed,  i.e.  the  order  determines  the  meaning  of  the  MR.2 
Hence  the  chart  generator  is  very  similar  to  the  chart  parser  in  Wasp,  except  for  the 
following: 

1 .  To  facilitate  the  computation  of  the  language  model,  chart  items  now  include 
a  list  of  (n  —  1) -grams  that  encode  the  context  in  which  output  NL  phrases 
appear.  The  size  of  the  list  is  2N  +  2,  where  N  is  the  number  of  non-terminals 
to  be  rewritten  in  the  partial  derivation. 

2.  Words  are  generated  from  word  gaps  through  special  rules  (g)  —>  (a,  0), 
where  the  word  gap,  (g),  of  size  g  is  treated  as  a  non-terminal,  and  a  is  the 
NL  string  that  fills  the  gap  (|a|  <  g ).  The  empty  set  symbol  indicates  that 
the  gap  filler  does  not  carry  any  meaning.  There  are  similar  constructs  in 
Carroll  et  al.  (1999)  for  generating  function  words.  Furthermore,  to  improve 
efficiency,  the  Wasp-1  generator  only  considers  gap  fillers  that  have  been 
observed  during  training. 

3.  The  normalizing  factor  in  (5.4),  Z\(e( d)),  is  not  a  constant  and  varies  across 
NL  translations,  e(d).  (Note  that  Z\(e)  is  constant  for  semantic  parsing  be¬ 
cause  e  is  given  as  input.)  This  is  unfortunate  because  the  calculation  of 
Z\(e( d))  is  expensive,  and  is  not  easy  to  incorporate  into  the  chart  genera¬ 
tion  algorithm.  Decoding  is  thus  performed  through  the  following  approx¬ 
imation:  First,  compute  the  k- best  candidate  NL  translations  based  on  the 
unnormalized  version  of  (5.4),  Pr(e(d))  ■  exp  JA  A , ( cl ) .  Then  re-rank  the 
list  by  normalizing  the  scores  using  ZA(e(d)),  which  is  obtained  by  running 

2See  Chapter  6  where  this  assumption  no  longer  holds. 
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the  inside-outside  algorithm  on  each  NL  translation.  This  results  in  a  decod¬ 
ing  algorithm  that  takes  cubic  time  with  respect  to  the  length  of  each  of  the  k 
candidate  NL  translations  (k  =  100  in  our  experiments).3 

5.3  Improving  the  MT-based  Generators 

The  MT-based  generation  algorithms,  PHARAOH  and  Wasp-1,  while  rea¬ 
sonably  effective,  can  be  substantially  improved  by  borrowing  ideas  from  each 
other.  This  section  describes  the  two  resulting  hybrid  systems,  PHARAOH++  (Sec¬ 
tion  5.3.1)  and  WASP_1++  (Section  5.3.2). 

5.3.1  Improving  the  PHARAOH-based  Generator 

A  major  weakness  of  PHARAOH  as  an  NLG  system  is  its  inability  to  exploit 
the  formal  structure  of  the  MRL.  As  with  WASP”1,  the  lexical  acquisition  algorithm 
of  Pharaoh  is  based  on  the  output  of  a  word  alignment  model  such  as  Giza++, 
which  performs  poorly  when  applied  directly  to  MRLs  due  to  a  large  amount  of 
semantically  vacuous  MR  symbols  (see  Section  3.2.1). 

We  can  improve  the  PHARAOH-based  generator  by  supplying  linearized  MR 
parse  trees  as  input  rather  than  flat  MR  strings.  As  a  result,  the  basic  translation 
units  are  sequences  of  consecutive  MRL  productions  in  a  linearized  MR  parse  tree 
rather  than  sequences  of  consecutive  symbols  in  an  MR  string.  The  same  idea  is 
used  in  WASP”1  to  produce  high-quality  SCFG  rules.  We  call  the  resulting  hy- 

3This  fc-best  approximation  can  be  avoided  by  choosing  a  formulation  of  Pr(e|f)  other  than  the 
noisy  channel,  e.g.  Pr(e(d))  Pr>,(d|f).  The  latter  probability  can  be  computed  using  a  log-linear 
model  trained  with  an  optimization  criterion  similar  to  Equation  3.6.  Also  Wu  and  Wong  (1998) 
point  out  that  normalization  of  the  translation  model  may  not  be  necessary  when  there  is  a  strong 
language  model.  However,  our  experiments  showed  that  normalization  was  necessary  for  WASP”1 
to  achieve  good  performance. 
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brid  NLG  system  PHARAOH++.  Figure  5.5  illustrates  the  generation  process  of 
PHARAOH++. 

5.3.2  Improving  the  Wasp-1  Algorithm 

There  are  several  aspects  of  Pharaoh  that  can  be  used  to  improve  Wasp-1  . 
First,  the  probabilistic  model  of  Wasp-1  is  less  than  ideal  as  it  requires  an  extra  re¬ 
ranking  step  for  normalization,  which  is  expensive  and  prone  to  over-pruning.  To 
remedy  this  situation,  we  can  borrow  the  log-linear  model  of  PHARAOH,  and  define 
the  conditional  probability  of  a  derivation,  d,  given  an  input  MR  string,  f ,  as: 

PrA(d|f)  cx  Pr(e(d))Al  JJtuA(r(d))  (5.5) 

dEd 

where  ELed  w\{r(d))  is  the  product  of  the  weights  of  the  SCFG  rules  used  in  a 
derivation  d.  The  weight  wx  of  an  SCFG  rule  is  in  turn  defined  as: 

wx(A  -►  (a,P))  =  P(f3\a)X2  P(a\f3)X3  Pw(f3\a)X4  Pw(a\f3)X5  exp(— |a|)Ae  (5.6) 
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where  the  relative  frequencies,  P,  and  lexical  weights,  Pw,  are  defined  analogously 
to  Equation  5.1.  The  word  penalty,  exp(— |a|),  offers  a  way  to  control  the  output 
sentence  length.  The  output  NL  translation,  e*,  is  then  the  sentence  that  the  most 
probable  derivation  consistent  with  f  yields: 


e*  =  e  arg 


An  advantage  of  this  formulation  of  e*  is  that  its  computation  requires  no  normal¬ 
ization  and  can  be  done  exactly  and  efficiently.  Also  the  model  parameters  A  are 
trained  such  that  the  Bleu  score  of  the  training  set  is  directly  maximized  (Och, 
2003).  Bleu  is  a  standard  evaluation  metric  in  the  MT  literature  for  assessing 
sentence  fluency  (Papineni  et  al.,  2002).4  Compared  to  the  maximum  conditional 
likelihood  criterion  used  in  Wasp-1,  the  maximum  Bleu  criterion  is  more  strongly 
correlated  with  translation  quality. 

Following  the  phrase  extraction  algorithm  in  Pharaoh,  we  eliminate  word 
gaps  by  incorporating  unaligned  words  as  part  of  the  extracted  NL  strings.  For 
example,  given  the  word  alignment  in  Figure  5.4,  the  following  SCFG  rule  would  be 
extracted  instead  of  the  one  shown  in  Section  5.2.2,  by  incorporating  the  unaligned 
word  the  into  the  NL  string: 

Condition  — ►  (  Team^  player  Unum^  has  the  ball , 


(bowner  TeAMqj  {UNUM^j})  ) 


The  reason  for  eliminating  word  gaps  is  that  while  they  are  useful  in  dealing  with 
unknown  phrases  during  semantic  parsing,  for  generation,  using  known  phrases  is 
generally  preferred  because  it  leads  to  better  fluency.  For  a  similar  reason,  we  also 
allow  the  extraction  of  SCFG  rules  that  are  combinations  of  shorter  SCFG  rules. 

4See  Section  5.4.2  for  a  more  detailed  description  of  Bleu. 
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In  other  words,  the  extracted  rules  are  not  restricted  to  the  shortest  ones  that  cover 
the  training  set.  This  is  because  using  known  combinations  of  shorter  phrases  can 
lead  to  better  fluency.  For  example,  given  the  word  alignment  in  Figure  5.4,  rules 
would  be  extracted  not  only  for  individual  MRL  productions  such  as  Team  — > 
our  and  Unum  — >  4,  but  also  for  combinations  of  productions  such  as  Condi¬ 
tion  — >  (bowner  our  {UNUM})  and  Rule  — ■>  (  (bowner  our  {UNUM}) 
Directive)  .  In  this  work,  we  restrict  the  number  of  productions  being  combined 
to  be  no  more  than  5. 

The  new  hybrid  system  is  called  Wasp-1++.  The  main  difference  between 
PHARAOH++  and  Wasp-1++  is  that  while  PHARAOH++  only  allows  contiguous 
lexical  items,  Wasp-1++  also  allows  discontiguous  lexical  items.  Wasp-1++  is 
also  similar  to  the  syntax-based  MT  system  of  Chiang  (2005),  which  is  based  on 
an  SCFG  with  Pharaoh’s  probabilistic  model.  The  main  differece  is  that  we  use 
the  MRL  grammar  to  constrain  rule  extraction,  so  that  significantly  fewer  rules  are 
extracted,  leading  to  a  learned  grammar  with  much  less  ambiguity. 

5.4  Experiments 

This  section  describes  the  experiments  that  were  performed  to  evaluate  the 
four  MT-based  NLG  systems  that  we  introduced  in  this  chapter,  namely  Pharaoh, 
Wasp-1,  Pharaoh++,  and  Wasp-1++.  We  first  present  results  from  the  automatic 
evauation  (Section  5.4.2),  followed  by  results  from  the  human  evaluation  (Section 

5.4.3) .  Then  we  show  the  experimental  results  on  a  multilingual  data  set  (Section 

5.4.4) . 


103 


5.4.1  Data  Sets 


We  evaluated  the  NLG  systems  in  the  Geoquery  and  RoboCup  domains. 
The  experimental  results  are  based  on  the  same  corpora  that  were  used  for  evalu¬ 
ating  the  Wasp  semantic  parsing  algorithm.  In  summary,  the  Geoquery  corpus 
consists  of  880  formal  queries  written  in  the  functional  query  language  FunQL, 
along  with  their  English  translations.  250  of  these  queries  were  also  annotated  with 
Spanish,  Japanese,  and  Turkish  translations.  The  average  sentence  length  for  the 
880-example  English  data  set  is  7.57  words.  The  RoboCup  corpus  consists  of  300 
pieces  of  coaching  advice  written  in  CLANG,  along  with  their  English  translations. 
The  average  sentence  length  for  the  300-example  data  set  is  22.52  words.  For  the 
detailed  corpus  statistics,  please  refer  to  Table  3.1. 

For  each  domain,  there  is  a  minimal  set  of  lexical  items  representing  knowl¬ 
edge  needed  for  translating  basic  domain  entities  (Section  3.3.2).  For  Geoquery, 
the  domain  entities  are  various  place  names.  For  RoboCup,  the  domain  entities 
are  numbers  and  identifiers.  Lexical  items  representing  these  domain  entities  are 
supplied  to  the  MT-based  generators  as  follows.  For  the  PHARAOH-based  genera¬ 
tors,  these  lexical  items  are  appended  to  the  training  set  as  separate  sentence  pairs, 
where  each  sentence  pair  corresponds  to  one  domain  entity.  This  method  has  been 
widely  used  in  the  statistical  MT  community  for  incorporating  bilingual  dictionaries 
as  an  additional  knowledge  source  (Brown  et  al.,  1993a;  Och  and  Ney,  2000).  For 
the  WASP-based  generators,  these  lexical  items  come  in  the  form  of  SCFG  rules, 
which  are  always  included  in  the  lexicon. 

5.4.2  Automatic  Evaluation 

We  performed  4  runs  of  standard  10-fold  cross  validation,  and  measured 
the  performance  of  the  learned  generators  using  the  Bleu  score  (Papineni  et  al., 
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2002)  and  the  Nist  score  (Doddington,  2002).  Both  automatic  evaluation  metrics 
approximate  human  assessment  by  comparing  candidate  translations  with  reference 
translations.  Specifically,  the  Bleu  score  is  the  geometric  mean  of  the  precision  of 
n-grams  of  various  lengths,  multiplied  by  a  brevity  penalty  factor,  BP,  that  penalizes 
candidate  translations  shorter  than  the  reference  translations: 


(5.8) 


n=l 


Here  N  —  4,  and  pn  denotes  the  n-gram  precision  of  candidate  translations  (i.e.  the 
proportion  of  n-grams  that  they  share  with  the  reference  translations).5  The  NlST 
score  is  also  based  on  n-gram  co-occurrences,  but  it  weighs  more  heavily  those  n- 
grams  that  occur  less  frequently  (and  hence  are  more  informative).  Also  it  uses  an 
alternative  brevity  penalty  factor,  BP',  that  minimizes  the  impact  of  small  variations 
in  the  length  of  candidate  translations  (but  penalizes  large  variations  more  heavily): 


N 


(5.9) 


Here  N  =  5,  and  p'n  denotes  the  weighted  n-gram  precision  of  candidate  transla¬ 
tions.  Bleu  and  Nist  are  standard  evaluation  metrics  in  the  MT  literature  (e.g.  Koehn 
and  Monz,  2006;  NIST,  2006).  Both  of  them  have  recently  been  used  for  evaluat¬ 
ing  NL  generators  (Langkilde-Geary,  2002;  Nakanishi  et  al.,  2005;  Belz  and  Reiter, 
2006). 

5Each  candidate  translation  may  correspond  to  multiple  reference  translations,  in  which  case  the 
n-gram  precision  would  increase.  In  the  Geoquery  corpus,  some  sentences  are  mapped  to  the  same 
formal  queries,  so  it  is  possible  to  supply  multiple  reference  translations  for  each  test  example.  We 
only  used  one  reference  translation  per  example,  however,  because  n-to-1  mappings  are  relatively 
few,  and  the  NIST  MT  evaluation  script  which  we  used  only  allows  a  constant  number  of  reference 
translations  for  all  test  examples. 
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Geoquery  880 

RoboCup 

Bleu 

NIST 

Bleu 

NIST 

Pharaoh 

0.2070 

3.1478 

0.3247 

5.0263 

Wasp-1 

0.4582 

5.9900 

0.4357 

5.4486 

PHARAOH++ 

0.5354 

6.3637 

0.4336 

5.9185 

Wasp-1++ 

0.5370 

6.4808 

0.6022 

6.8976 

Table  5.1:  Automatic  evaluation  results  for  NL  generators  on  the  English  corpora 


Geoquery  (880  data  set) 

RoboCup 

Pharaoh 

0.1  s 

0.7  s 

Wasp-1 

2.4  s 

49.7  s 

PHARAOH++ 

0.03  s 

0.1  s 

Wasp-1++ 

0.7  s 

8.2  s 

Table  5.2:  Average  time  needed  for  generating  one  test  sentence 


Table  5.1  presents  the  automatic  evaluation  results.  The  best-performing 
systems  for  each  domain  are  shown  in  bold,  where  paired  /  -tests  are  used  to  deter¬ 
mine  statistical  significance.  Figures  5.6  and  5.7  show  the  Bleu  and  Nist  learning 
curves  for  PHARAOH++  and  Wasp-1++  (based  on  a  single  run  of  10-fold  cross 
validation). 

A  few  observations  can  be  made.  First,  WASP”1  produced  more  accurate 
NF  generators  than  Pharaoh  (p  <  0.001).  Second,  PHARAOH++  significantly 
outperformed  Pharaoh  (p  <  0.001).  Both  observations  show  the  importance  of 
exploiting  the  formal  structure  of  the  MRF.  Third,  Wasp-1++  significantly  outper¬ 
formed  Wasp-1  (p  <  0.001).  Much  of  the  gain  came  from  Pharaoh’s  probabilis¬ 
tic  model.  Decoding  was  also  much  faster  (Table  5.2),  despite  exact  inference  and 
a  larger  grammar  due  to  the  extraction  of  longer  SCFG  rules. 

Note  that  Wasp-1++  significantly  outperformed  PHARAOH++  with  full  train- 
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Figure  5.6:  Learning  curves  for  NL  generators  on  the  GEOQUERY  880  data  set 
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Figure  5.7:  Learning  curves  for  NL  generators  on  the  RoboCup  data  set 
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Reference:  If  our  player  2,  3,  7  or  5  has  the  ball  and  the  ball  is  close  to  our  goal 
line ... 

PHARAOH++:  If  player  3  has  the  ball  is  in  2  5  the  ball  is  in  the  area  near  our  goal 
line ... 

Wasp-1++:  If  players  2,  3,  7  and  5  has  the  ball  and  the  ball  is  near  our  goal  line 

Figure  5.8:  Partial  NL  generator  output  in  the  RoboCup  domain 

ing  data  in  the  RoboCup  domain  (p  <  0.001).  This  is  because  Wasp-1++  allows 
discontiguous  lexical  items  whereas  PHARAOH++  does  not.  Such  lexical  items  are 
commonly  used  in  RoboCup  for  constructions  like:  players  2,3,7  and  5  (Figure 
5.8);  26.96%  of  the  lexical  items  that  Wasp-1++  used  during  testing  were  discon¬ 
tiguous.  When  faced  with  such  cases,  PHARAOH++  would  consistently  omit  some 
of  the  words  (e.g.  players  2  3  7  5),  or  not  learn  any  phrases  for  those  constructions 
at  all.  As  a  result,  given  some  input  MRs,  PHARAOH++  would  fail  to  find  fluent 
NL  translations  that  preserve  their  meanings  (Figure  5.8).  On  the  other  hand,  for 
Geoquery,  only  4.47%  of  the  lexical  items  that  Wasp-1++  used  during  testing 
were  discontiguous,  so  the  advantage  of  Wasp-1++  over  PHARAOH++  was  not  as 
obvious  (p  <  0.01  for  NlST,  no  significant  difference  for  Bleu). 

With  limited  training  data,  PHARAOH++  outperformed  Wasp-1++  for  both 
Geoquery  and  RoboCup  domains  (Figures  5.6  and  5.7).  The  reason  is  two¬ 
fold.  First,  PHARAOH++  learned  simpler  models  than  WASP-1++  by  restricting 
all  lexical  items  to  be  contiguous.  Second,  PHARAOH++  had  better  coverage  than 
Wasp- 1++  given  small  training  sets,  i.e.  more  test  examples  received  NL  transla¬ 
tions  under  PHARAOH++  than  Wasp-1++  (Figure  5.9).  This  is  because  previously 
unseen  MR  predicates,  left  untranslated,  are  included  in  the  output  of  PHARAOH++, 
ensuring  100%  coverage.  In  contrast,  Wasp-1  would  fail  to  produce  any  NL  trans¬ 
lations  if  there  is  any  previously  unseen  predicate  in  an  input  MR,  leading  to  high 
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brevity  penalty  in  the  Bleu  and  Nist  scores  (especially  for  NIST).  Note  that  al¬ 
though  PHARAOH++  always  generates  some  output,  its  output  sentences,  laden  with 
MR  symbols,  are  often  unintelligible. 

Our  Bleu  scores  are  not  as  high  as  those  reported  in  Langkilde-Geary 
(2002)  and  Nakanishi  et  al.  (2005),  which  are  around  0.7  to  0.9.  However,  their 
work  involves  the  re-generation  of  automatically  parsed  text,  and  the  MRs  that  they 
use,  which  are  essentially  dependency  parses,  contain  extensive  lexical  information 
of  the  target  NL. 

5.4.3  Human  Evaluation 

Automatic  evaluation  is  only  an  imperfect  substitute  for  human  assessment. 
While  it  has  been  found  that  Bleu  and  Nist  correlate  quite  well  with  human  judg¬ 
ments  in  evaluating  NLG  systems  (Belz  and  Reiter,  2006),  it  is  best  to  support 
these  figures  with  human  evaluation,  which  did  on  a  small  scale.  We  recruited  4 
native  speakers  of  English  with  no  previous  experience  with  the  GEOQUERY  and 
RoboCup  domains.  Each  subject  was  given  the  same  20  examples  for  each  do¬ 
main,  randomly  chosen  from  the  test  sets.  For  each  example,  the  subjects  were 
asked  to  judge  the  output  of  PHARAOH++  and  Wasp_1++  in  terms  of  fluency  and 
adequacy.  The  fluency  score  shows  how  fluent  a  generated  sentence  is  with  no 
reference  to  what  meaning  it  is  supposed  to  convey.  The  adequacy  score  shows 
how  well  a  generated  sentence  conveys  the  meaning  of  the  reference  sentence.  The 
subjects  were  presented  with  the  reference  sentences  in  order  to  evaluate  adequacy. 
They  were  also  presented  with  the  following  definition  of  fluency  and  adequacy 
scores,  adapted  from  Koehn  and  Monz  (2006): 
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Figure  5.9:  Coverage  of  NL  generators  on  the  English  corpora 
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Geoquery  880 

RoboCup 

Fluency 

Adequacy 

Fluency 

Adequacy 

PHARAOH++ 

4.3 

4.7 

2.5 

2.9 

Wasp_1++ 

4.1 

4.7 

3.6 

4.0 

Table  5.3:  Human  evaluation  results  for  NL  generators  on  the  English  corpora 


Fluency  score 

English  proficiency 

Adequacy  score 

Meaning  conveyed 

5 

Flawless  English 

5 

All  meaning 

4 

Good  English 

4 

Most  meaning 

3 

Non-native  English 

3 

Some  meaning 

2 

Disfluent  English 

2 

Little  meaning 

1 

Incomprehensible 

1 

No  meaning 

For  each  test  example,  we  computed  the  average  of  the  4  human  judges’  scores. 
No  score  normalization  was  performed.  Then  we  compared  the  two  systems  using 
paired  /-tests.  Table  5.3  shows  that  Wasp-1++  consistently  produced  good  English 
sentences  that  preserved  most  of  the  meaning  conveyed  by  the  reference  sentences. 
It  also  produced  better  NL  generators  than  PHARAOH++  in  the  RoboCup  domain 
(p  <  0.01),  which  is  consistent  with  the  results  of  automatic  evaluation. 

5.4.4  Multilingual  Experiments 

Lastly,  we  describe  our  experiments  on  the  multilingual  Geo  QUERY  data 
set.  Table  5.4  presents  the  automatic  evaluation  results  for  WASP-1++  in  four  target 
NLs,  namely  English,  Spanish,  Japanese  and  Turkish,  compared  to  PHARAOH++. 
Figure  5.10  shows  the  Bleu  and  Nist  learning  curves  for  Wasp_1++  (based  on 
a  single  run  of  10-fold  cross  validation).  Similar  to  previous  results  on  the  larger 
Geoquery  data  set,  Wasp-1++  outperformed  Pharaoh++  for  some  language- 
metric  pairs  {p  <  0.05),  and  otherwise  performed  comparably.  Also  consistent  with 
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PHARAOH++ 

Wasp_1++ 

Bleu 

NlST 

Bleu 

NlST 

English 

0.5344 

5.3289 

0.6035 

5.7133 

Spanish 

0.6042 

5.6321 

0.6175 

5.7293 

Japanese 

0.6171 

4.5357 

0.6585 

4.6648 

Turkish 

0.4562 

4.2220 

0.4824 

4.3283 

Table  5.4:  Performance  of  Wasp  x++  on  the  multilingual  Geoquery  data  set 

previous  results  for  semantic  parsing  (Sections  3.3.3  and  4.3.2),  the  performance 
of  the  NLG  systems  was  the  lowest  for  Turkish,  an  agglutinative  language  with  a 
relatively  large  vocabulary.  The  NlST  scores  for  Japanese  were  also  relatively  low, 
although  the  Bleu  scores  were  disproportionately  high.  A  possible  reason  is  that 
function  morphemes,  which  are  made  separate  tokens  in  the  Japanese  corpus,  are 
given  too  much  weight  in  the  Bleu  score. 

5.5  Chapter  Summary 

In  this  chapter,  we  formulated  the  problem  of  tactical  generation  as  a  lan¬ 
guage  translation  task,  where  formal  MRs  are  translated  into  NL  sentences  using 
statistical  MT.  We  presented  results  on  using  a  recent  statistical  MT  system  called 
Pharoah  for  tactical  generation.  We  also  showed  that  the  Wasp  semantic  parsing 
algorithm  can  be  inverted  to  produce  a  tactical  generation  system  called  WASP”1. 
This  approach  allows  the  same  learned  grammar  to  be  used  for  both  parsing  and 
generation.  Also  it  allows  the  chart  parser  in  Wasp  to  be  used  for  generation  with 
minimal  modifications.  While  reasonably  effective,  both  Pharaoh  and  Wasp-1 
can  be  substantially  improved  by  borrowing  ideas  from  each  other.  Hence  we  pre¬ 
sented  two  hybrid  systems,  PHARAOH++  and  Wasp_1++.  All  four  systems  re¬ 
quire  source  MRLs  to  be  variable-free.  We  outlined  a  series  of  experiments  that 
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Figure  5.10:  Learning  curves  for  Wasp  x++  on  multilingual  Geoquery  data 
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demonstrate  the  effectiveness  of  these  tactical  generation  systems,  based  on  auto¬ 
matic  evaluation  metrics  and  human  assessment.  The  SCFG-based  hybrid  system 
Wasp-1++,  produced  by  inverting  Wasp  and  incorporating  Pharaoh’s  proba¬ 
bilistic  model,  was  shown  to  achieve  the  best  overall  results  across  different  lan¬ 
guages  and  application  domains. 
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Chapter  6 


Natural  Language  Generation  with  Logical  Forms 


This  chapter  completes  the  last  piece  of  the  Wasp  puzzle,  introducing  a  tac¬ 
tical  generation  algorithm  that  accepts  input  logical  forms.  The  tactical  generation 
algorithm,  A-Wasp_1-h-,  is  a  straightforward  extension  of  Wasp-1-h-,  in  which 
the  underlying  grammar  is  a  A-SCFG.  This  allows  the  same  learned  A-SCFG  to  be 
used  for  both  parsing  and  generation. 

6.1  Motivation 

As  mentioned  in  Chapter  4,  linguists  have  traditionally  used  predicate  logic 
to  represent  meanings  associated  with  NL  expressions.  Most  existing  NLG  systems 
are  based  on  predicate  logic  (White,  2004;  Carroll  and  Oepen,  2005;  Nakanishi 
et  al.,  2005).  A  prominent  feature  of  predicate  logic  is  its  use  of  logical  variables 
to  denote  entities.  In  Chapter  4,  we  showed  how  logical  variables  can  be  generated 
using  a  synchronous  grammar,  and  how  such  a  grammar  can  be  learned  from  an 
annotated  corpus  for  semantic  parsing.  An  interesting  problem  would  be  to  use  the 
same  learned  grammar  for  NLG  as  well. 

On  the  other  hand,  most,  if  not  all,  existing  NLG  systems  that  can  han¬ 
dle  input  logical  forms  involve  substantial  human-engineered  components  that  are 
difficult  to  maintain.  For  example,  White  (2004)  describes  a  hybrid  symbolic- 
statistical  realizer  under  the  OpenCCG  framework,  in  which  CCG  grammars  are 
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hand-written.  Carroll  and  Oepen  (2005)  describes  a  similar  system  using  the  En¬ 
glish  Resource  Grammar  (Copestake  and  Flickinger,  2000).  Other  NLG  systems 
that  are  machine-learned  typically  require  input  representations  that  contain  exten¬ 
sive  lexical  information  of  the  target  NL  (Langkilde-Geary,  2002;  Corston-Oliver 
et  al.,  2002;  Nakanishi  et  al.,  2005;  Soricut  and  Marcu,  2006). 

In  this  chapter,  we  show  that  the  Wasp-1++  generation  algorithm  (Sec¬ 
tion  5.3.2)  can  be  readily  extended  to  support  input  logical  forms.1  The  resulting 
algorithm,  which  we  call  A-Wasp-1++,  uses  the  same  grammar  as  the  A-Wasp  se¬ 
mantic  parser  (Section  4.2).  It  automatically  learns  all  of  its  linguistic  knowledge 
from  an  annotated  corpus  consisting  of  NL  sentences  coupled  with  their  correct 
logical  forms.  Moreover,  it  does  not  require  any  lexical  information  in  the  input 
representations,  so  lexical  selection  is  an  integral  part  of  the  decoding  algorithm. 

6.2  The  A-Wasp_1++  Algorithm 

This  section  describes  the  A-Wasp-1++  generation  algorithm.  We  first  give 
an  overview  of  the  chart  generation  algorithm  (Section  6.2.1).  Then  we  discuss 
fc-best  decoding  for  A-Wasp-1++  (Section  6.2.2),  which  is  needed  for  minimum 
error-rate  training  of  the  probabilistic  model. 

6.2.1  Overview 

The  A-Wasp-1++  generation  algorithm  is  a  straightforward  extension  of 
Wasp-1++.  Recall  that  in  Wasp-1++,  the  problem  of  tactical  generation  is  seen  as 
translating  formal  MRs  into  NL  sentences  using  an  SCFG.  Wasp-1++  uses  a  log- 

1  Although  the  WASP-1  algorithm  (Section  5.2.2)  can  be  modified  in  a  similar  way,  we  only 
consider  WASP-1++  because  of  its  better  performance. 
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linear  model  for  parse  disambiguation  (Equation  5.5).  An  Earley  chart  generator 
that  scans  the  input  MR  string  from  left  to  right  is  used  for  decoding,  which  is  fine 
because  it  is  assumed  that  the  order  in  which  MR  symbols  appear  determines  the 
meaning  of  the  MR  string. 

In  order  to  support  input  logical  forms,  we  simply  replace  the  underlying 
SCFG  grammar  of  Wasp_1++  with  a  A-SCFG  (Section  4.2.1).  The  probabilistic 
model  for  generation  remains  unchanged.  We  call  the  resulting  generation  algo¬ 
rithm  A-Wasp-1++.  To  learn  a  A-Wasp-1++  generator,  we  use  the  lexical  acquisi¬ 
tion  algorithm  described  in  Sections  4.2.2  and  4.2.4,  and  the  parameter  estimation 
algorithm  described  in  Section  5.3.2. 

However,  there  is  a  major  difference  between  A-Wasp-1++  and  Wasp-1++. 
While  in  WASP“1++,  it  can  be  safely  assumed  that  the  order  in  which  MR  symbols 
appear  is  significant,  this  assumption  no  longer  holds  in  A-Wasp-1.  As  mentioned 
in  Section  4.2.4,  certain  logical  operators  such  as  conjunction  (, )  are  associative 
and  commutative.  Hence,  conjuncts  can  be  reordered  and  regrouped  without  chang¬ 
ing  the  meaning  of  a  conjunction.  In  other  words,  the  relative  order  of  conjuncts 
in  a  conjunction  is  irrelevant.  For  example,  given  the  following  two  input  logical 
forms: 


answer  (aq,  (river  (aq)  ,  loc  (aq,  x2)  , 

equal  (x2,  stateid  (Colorado)  )  )  ) 

answer  (aq,  (river  (aq)  ,  equal  (aq,  stateid  (Colorado)  )  , 
loc  (xlrx2)  )  ) 

(Name  all  the  rivers  in  Colorado.) 

the  generated  NF  sentences  should  be  identical,  even  though  the  relative  order  of 
conjuncts  lo  c(aq,aq)  and  equal  (aq,  stateid  (Colorado)  )  is  different. 
This  requires  a  different  chart  generator  than  the  one  used  in  Wasp-1++. 
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In  this  section,  we  describe  a  decoding  algorithm  that  can  handle  input  log¬ 
ical  forms.  As  we  will  see  in  Section  6.2.2,  this  decoding  algorithm  is  also  used  in 
minimum  error-rate  training  of  A-Wasp_1++.  Suppose  that  we  have  an  A-SCFG, 
G,  which  consists  of  the  following  rules: 

r\.  QUERY  — >  (  Name  all  the  FORM^  ,  answer  {xlr  (F0RMqj(xi)  )  )  ) 

r2:  Form  — >  (  rivers  in  ForMqj  , 

Aaq. river  (aq)  ,  loc  (aq,  x2)  ,  FORM[^(a;2)  ) 
r3:  Form  — >  (  Stater  ,  Aaq. equal  (aq,  State^)  ) 

r4:  State  — >  (  StateNameq]  ,  stateid  (StateName^)  ) 

r§\  STATENAME  — >  (  Colorado  ,  Colorado  ) 

Given  the  following  input  logical  form: 

answer  (x\,  (river  (aq)  ,  equal  (x2,  stateid  (Colorado)  )  , 
loc  (xi,x2)  )  ) 

the  decoding  task  is  to  find  a  derivation  under  G  that  is  consistent  with  this  logi¬ 
cal  form.  Such  a  derivation  exists,  but  only  if  we  consider  partial  derivations  that 
cover  disjoint  sets  of  input  symbols.  For  example,  the  rule  r2  matches  river  ( x i ) 
and  loc  (xi,  x2)  in  the  logical  form,  but  these  two  formulas  are  separated  by  an¬ 
other  formula  equal  (x2,  stateid  (Colorado)  ) .  Since  a  partial  derivation 
may  cover  a  disjoint  set  of  input  MR  symbols,  a  chart  item  takes  the  form  of  a  cov¬ 
erage  vector  with  a  bit  for  each  formula  (or  term)  in  the  input  logical  form  showing 
whether  the  formula  (or  term)  is  covered  by  the  chart  item.  The  set  of  formulas 
(and  terms)  in  a  logical  form  can  be  found  using  the  MRL  grammar.  For  example, 
Figure  6. 1  shows  a  parse  tree  for  the  logical  form  shown  above.  In  this  parse  tree, 
each  production  correpsonds  to  a  formula  (e.g.  river  {x\ ) )  or  a  term  that  is  not  a 
logical  variable  (e.g.  Colorado).  The  relative  order  of  these  formulas  and  terms 
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Query[1] 


answer  (xi,  (  Form[2]  )) 
river  (xi)  ,  Form[3] 


equal  (x2,  State[4]  ),  Form[6] 

loc(xi,x2) 

stateid(  StateName[5]  ) 

I 

Colorado 

Figure  6.1:  A  parse  tree  for  the  sample  Prolog  logical  form 


are  shown  in  bracketed  indices,  [1]— [6] .  This  ordering  corresponds  to  the  order  of 
a  top-down,  left-most  derivation.  Each  chart  item  for  the  sample  logical  form  thus 
contains  a  coverage  vector  of  6  bits,  a  bit  for  each  production  in  the  parse  tree.  We 
use  [i,j, . . .]  to  denote  a  bit  vector  in  which  bits  i,j, ...  are  set.  The  decoding  al¬ 
gorithm  starts  with  the  creation  of  a  set  of  initial  chart  items,  which  involves  the 
computation  of  coverage  vectors  for  each  rule  in  G : 

(ri,[l],{xi/xi}) 

(r2,  [2,6],  {xi/xi ,x2/x2}) 

(r3,  [3],  {x\/ x2}) 

(t-4,[4],0) 

(r5>  [5],  {}) 

Each  chart  item  also  contains  a  substitution,  {xi/x^, . . . ,  Xk/xik},  that  shows  the 
renaming  of  logical  variables  necessary  to  transform  the  MR  string  of  the  rule  into 
the  part  of  the  input  logical  form  that  the  chart  item  covers.  For  example,  for  the 
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rule  r3,  the  substitution  is  {xi/x2},  because  the  logical  variable  x\  in  the  MR  string 
of  r3  corresponds  to  a;2  in  the  input  logical  form.  Note  that  each  rule  in  G  can 
give  rise  to  multiple  distinct  chart  items  (or  none  at  all).  A  chart  item  is  said  to  be 
inactive  if  all  RHS  non-terminals  in  the  rule  have  been  rewritten.  Otherwise,  a  chart 
item  is  said  to  be  active.  For  example,  all  chart  items  shown  above  are  active  except 
the  last  one,  as  there  are  no  RHS  non-terminals  in  r5. 

Decoding  proceeds  by  repeatedly  combining  chart  items.  An  active  item, 
(ra,  va,  a a),  may  combine  with  an  inactive  item,  (r,.  vt ,  at),  if  all  of  the  following 
conditions  are  met: 

1.  The  inactive  item  completes  the  active  item. 

2.  The  coverage  vectors  va  and  vr  are  disjoint. 

3.  The  substitution  cr*  is  compatible  with  aa. 

To  illustrate  these  conditions,  consider  the  inactive  item  (r5,  [5],  {}).  It  can  combine 
with  the  active  item  (r4,  [4],  {}),  because  [5]  occupies  the  argument  position  of  [4] 
(Condition  1),  and  [4]  and  [5]  are  disjoint  (Condition  2).  Condition  3  is  also  met 
because  a,  is  empty.  The  combination  of  these  two  items  results  in  a  new  item, 
(r4,  [4-5],  {}),  where  [4-5]  is  the  union  of  [4]  and  [5],  and  {}  is  the  union  of  <7; 
and  aa.  This  new  item  is  inactive  because  all  RHS  non-terminals  in  r4  have  been 
rewritten. 

This  new  item  can  then  combine  with  the  active  item  (r3,  [3],  { x  i  / x2 } ) ,  be¬ 
cause  [4-5]  occupies  the  argument  position  of  [3]  (Condition  1),  [3]  and  [4-5]  are 
disjoint  (Condition  2),  and  is  empty  (Condition  3).  This  results  in  a  new  inactive 
item,  (r3,  [3-5],  {xi/x2}),  where  {x4/x2}  is  the  union  of  a*  and  cra. 
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This  new  item  can  then  combine  with  (r2,  [2,  6],  {x\/x\,  x2/x2}):  Condition 
1  is  met  because  [2,  6]  and  [3-5]  together  form  a  logical  conjunction.  Condition  2 
is  met  because  [2,  6]  and  [3-5]  are  disjoint.  For  Condition  3,  note  that  the  MR 
string  of  r3,  which  is  a  A-function,  is  used  to  rewrite  the  Form  non-terminal  in 
the  MR  string  of  r2.  Upon  function  application,  all  bound  occurrences  of  x\  in  the 
A-function  would  be  renamed  to  x2,  and  therefore,  occurrences  of  x  \  in  a,  should 
be  renamed  to  x2  as  well.  This  results  in  a  new  substitution  a[  =  {x2/x2},  which  is 
compatible  with  aa  =  {xi/x\,  x2/x2}  because  there  is  no  Xj  such  that  Xj/xj>  e  cr', 
Xj/xj"  G  a a,  and  :/y  d  x3>> .  The  combination  of  these  two  items  thus  gives  rise  to 
a  new  inactive  item,  (r2,  [2-6],  {x\/xi,  x2/x2}),  where  {x\/xi,  x2/x2}  is  the  union 
of  cr'  and  cra. 

Lastly,  this  new  item  combines  with  (ry,  [1],  { x  i  /x ,  } ) .  The  resulting  item 
is  (ry,  [1-6],  {ay /ay}). 2  Since  all  6  bits  of  the  coverage  vector  are  set,  this  item  is 
a  goal  item,  which  corresponds  to  a  complete  derivation  of  the  input  logical  form. 
The  NL  string  that  this  derivation  yields  is  then  a  translation  of  this  logical  form. 

Figure  6.2  shows  the  basic  decoding  algorithm  of  A-Wasp_1++.  Inactive 
items  are  examined  in  ascending  order  of  item  size  (i.e.  number  of  true  bits  in 
the  coverage  vector).  Combine-Items(c,  c')  returns  the  item  resulting  from  the 
combination  of  c  and  c! .  It  returns  null  if  c  and  d  cannot  combine.  Each  item 
is  associated  with  a  probability  as  defined  by  the  log-linear  model  (Equation  5.5). 
Update-Chart^,  c")  adds  c"  to  C  if  c"  is  not  already  in  C,  or  replaces  the  item 
in  C  with  c"  if  c"  has  a  higher  probability.  The  output  of  this  decoding  algorithm  is 
the  most  probable  derivation  consistent  with  the  input  logical  form.  This  algorithm 

2The  substitution  {ay  / x-\ }  does  not  include  any  mapping  from  because  x2  is  a  free  variable 
in  r2  and  is  no  longer  visible  outside  r2.  Following  Kay  (1996),  we  keep  track  of  all  logical  vari¬ 
ables  that  have  become  invisible  (e.g.  x2).  A  partial  derivation  is  filtered  out  if  any  of  these  logical 
variables  is  accidentally  bound. 
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Input:  a  logical  form,  f,  a  A-SCFG,  G,  and  an  unambiguous  MRL  grammar,  G' . 


DECODE-A-WASP“1++(f,  G,  G') 

1  f '  linearized  parse  of  f  under  G' 

2  C  < —  set  of  initial  chart  items  based  on  f '  and  G 

3  for  i  <—  1  to  |f'|  —  1 

4  do  for  each  inactive  item  c  E  C  of  size  i 

5  do  for  each  active  item  d  E  C 

6  do  c"  ^  Combine-Items  (c,c') 

7  if  c"  is  not  null 

8  then  Update-Chart^,  c") 

9  return  c  E  G  of  size  |f'|  with  the  highest  probability 


Figure  6.2:  The  basic  decoding  algorithm  of  A-Wasp  1++ 

can  take  exponential  time,  since  there  can  be  2lf  l  distinct  coverage  vectors  for  a 
given  logical  form,  f .  This  seems  reasonable  because  most  other  generation  algo¬ 
rithms  that  accept  input  logical  forms  operate  in  exponential  time  as  well  (Moore, 
2002;  White,  2004;  Carroll  and  Oepen,  2005).  Moreover,  generation  can  be  sped  up 
considerably  by  pruning  away  low-probability  inactive  items  before  each  iteration 
of  the  outer  for  loop  (i.e.  before  line  4).  In  our  experiments,  we  retain  only  the  top 
100  x  |  f '  |  inactive  items  for  each  iteration. 

6.2.2  k-Best  Decoding 

In  A-Wasp_1++,  parameters  of  the  probabilistic  model  are  trained  using 
minimum  error-rate  training,  such  that  the  Bleu  score  of  the  training  set  is  directly 
maximized.  Computation  of  Bleu  requires  actual  generator  output,  and  therefore, 
it  involves  decoding.  Moreover,  optimization  of  the  Bleu  score  requires  the  com¬ 
putation  of  Bleu  for  multiple  parameter  settings.  Och  (2003)  presents  an  efficient 
method  for  optimizing  Bleu  using  log-linear  models.  The  basic  idea  is  to  approxi- 
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mate  the  Bleu  score  by  performing  A; -best  decoding  for  only  a  handful  of  parameter 
settings. 

In  the  previous  section,  we  presented  a  1-best  decoding  algorithm  for  A- 
Wasp~1++.  A  naive  implementation  of  A; -best  decoding  would  compute  the  A; -best 
derivations  for  every  chart  item.  However,  this  can  be  prohibitively  slow  given 
that  it  already  takes  exponential  time  when  k  —  1.  In  this  section,  we  describe 
an  efficient  A>best  decoding  algorithm  for  A-Wasp_1++.  Originally  developed  by 
Huang  and  Chiang  (2005),  this  algorithm  finds  100-best  derivation  lists  almost  as 
fast  as  1-best  decoding. 

To  see  why  the  naive  implementation  of  A;-best  decoding  is  slow,  consider 
the  case  where  two  chart  items,  c  and  c',  combine  to  form  a  new  chart  item,  c" . 
Finding  the  A; -best  derivations  for  c"  involves  the  following  steps: 

1.  Enumerate  k 2  derivations  for  c",  based  on  the  A; -best  derivations  for  c  and  c" . 

2.  Sort  these  k 2  derivations. 

3.  Select  the  first  k  derivations  from  the  sorted  list  of  k 2  derivations. 

This  increases  the  time  complexity  of  the  decoder  by  a  factor  of  0(k 2  log  k).  How¬ 
ever,  since  we  are  only  interested  in  the  top  k  derivations  for  c",  the  first  two  steps 
can  be  eliminated  if  we  assume  that: 

1 .  The  A;-best  lists  for  c  and  c!  are  sorted. 

2.  The  function  that  computes  the  probability  of  a  derivation  is  monotonic  in 
each  of  its  sub-derivations.3 

3The  use  of  a  language  model  makes  this  function  only  approximately  monotonic,  e.g.  certain 
combinations  of  common  phrases  can  be  highly  unlikely.  In  this  case,  the  fc-best  decoding  algorithm 
is  only  approximate. 
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1  4  7  10 


1  4  7  10 


1  4  7  10 


Let  c[i\  be  the  i- th  element  in  the  k- best  list  for  c.  Given  the  assumptions  above,  it 
is  clear  that  c"[l]  is  the  combination  of  c[l]  and  c'[  1].  Furthermore,  c"[ 2]  is  either 
the  combination  of  c[l]  and  c'[2],  or  the  combination  of  c[2]  and  c'[l].  In  general, 
if  we  view  all  possible  combinations  as  a  grid  of  cells  (see  Figure  6.3,  where  the 
numbers  are  negative  log-probabilities),  then  the  next  cell  to  enumerate  must  be 
adjacent  to  the  previously  enumerated  cells,  i.e.  it  must  be  one  of  the  cells  shaded 
gray.  Therefore,  we  need  only  consider  0(k)  cells,  and  can  safely  ignore  the  rest  of 
the  grid. 

From  Figure  6.3,  it  is  evident  that  to  compute  the  k- best  list  for  c",  we  do 
not  need  the  full  k- best  lists  for  c  and  d .  In  general,  since  we  are  only  interested  in 
the  k- best  list  for  the  goal  items,  we  do  not  need  the  full  /c-best  list  for  every  item 
in  the  chart.  As  we  go  further  down  the  derivation  forest,  the  number  of  derivations 
required  for  each  item  becomes  less  and  less.  Therefore,  we  can  speed  up  the  A; -best 
decoding  algortihm  considerably  by  computing  /c-best  lists  only  when  necessary. 
Details  of  the  lazy  computation  of  /c-best  lists  can  be  found  in  Huang  and  Chiang 
(2005). 
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6.3  Experiments 

In  this  section,  we  present  experimental  results  that  demonstrate  the  effec¬ 
tiveness  of  the  A- Wasp'  '++  generation  algorithm. 

6.3.1  Data  Sets  and  Methodology 

We  evaluated  A-Wasp_1++  in  the  Geoquery  domain.  In  the  experiments, 
we  used  the  same  Geoquery  data  set  used  to  evaluate  A-Wasp  (Section  4.3.1). 
Specifically,  the  original  Prolog  logical  forms  were  used.  Table  4.1  shows  the  cor¬ 
pus  statistics. 

We  only  performed  automatic  evaluation,  based  on  4  runs  of  standard  10- 
fold  cross  validation,  using  the  Bleu  and  NlST  scores  as  the  evaluation  metrics.  We 
did  not  perform  human  evaluation,  since  our  human  evaluation  results  in  Section 

5.4.3  indicate  that  the  Bleu  and  Nist  scores  correlate  well  with  human  judgments 
in  evaluating  NLG  systems  in  this  domain. 

6.3.2  Results  and  Discussion 

Table  6.1  shows  the  performance  of  A-Wasp_1++  on  the  Geoquery  880 
data  set  with  full  training  data,  compared  to  two  other  NLG  systems: 

•  PHARAOH++  (Section  5.3.1),  which  uses  statistical  phrase-based  MT. 

•  Wasp_1-i— r  (Section  5.3.2),  the  inverse  of  the  Wasp  semantic  parser,  with 
Pharaoh’s  probabilistic  model. 

Unlike  A- WASP"  '++.  both  PHARAOH++  and  Wasp_1++  take  functional  queries 
(written  in  FunQL)  as  input.  The  best-performing  systems  based  on  paired  t-tests 
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Geoquery  (880  data  set) 

Bleu 

NIST 

A-Wasp-1++ 

0.5320 

6.4668 

PHARAOH++ 

0.5354 

6.3637 

Wasp-1++ 

0.5370 

6.4808 

Table  6.1:  Performance  of  A-Wasp  1++  on  the  Geoquery  880  data  set 


Geoquery  (880  data  set) 

A-Wasp_1++ 

2.9  s 

PHARAOH++ 

0.03  s 

Wasp_1++ 

0.7  s 

Table  6.2:  Average  time  needed  for  generating  one  test  sentence 

are  shown  in  bold  (p  <  0.05).  Figure  6.4  shows  the  learning  curves  (based  on  a 
single  run  of  10-fold  cross  validation). 

Table  6.1  shows  that  the  performance  of  A-Wasp~1++  is  comparable  to  that 
of  PHARAOH++  and  Wasp_1++,  despite  markedly  different  input  representations. 
Pruning  also  kept  the  running  time  to  a  reasonable  level  (Table  6.2),  although  the 
decoding  algorithm  could  take  exponential  time. 

Figure  6.4  shows  that  A-Wasp_1++  outperformed  Wasp-1++  with  limited 
training  data.  This  is  because  the  lexical  acquisition  algorithm  of  A-Wasp_1++ 
(i.e.  that  of  A-Wasp)  produces  rules  that  generalize  better  (Section  4.2.4).  Hence 
coverage  is  significantly  higher  for  A-Wasp_1++,  especially  when  the  training  set 
is  small  (Figure  6.5),  leading  to  steeper  learning  curves  in  terms  of  the  Bleu  and 
Nist  scores.  However,  Wasp-1++  quickly  caught  up  in  terms  of  coverage  as  more 
training  data  was  available.  This  is  unlike  the  parsing  case  where  WASP  failed  to 
keep  up  with  A-Wasp  in  terms  of  recall  (Figure  4.11).  This  indicates  that  tactical 
generation  is  an  easier  task  than  semantic  parsing.  While  for  tactical  generation 
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PHARAOH++ 

Wasp-1++ 

A-Wasp-1++ 

Bleu 

NIST 

Bleu 

NIST 

Bleu 

Nist 

English 

0.5344 

5.3289 

0.6035 

5.7133 

0.6121 

5.8254 

Spanish 

0.6042 

5.6321 

0.6175 

5.7293 

0.6584 

5.9390 

Japanese 

0.6171 

4.5357 

0.6585 

4.6648 

0.6857 

4.8330 

Turkish 

0.4562 

4.2220 

0.4824 

4.3283 

0.4737 

4.3553 

Table  6.3:  Performance  of  A-Wasp  *++  on  multilingual  GEOQUERY  data 

it  suffices  to  learn  one  mapping  for  each  MR  predicate  to  get  complete  coverage, 
for  semantic  parsing  one  needs  to  learn  a  mapping  for  each  NL  phrase  to  achieve 
perfect  recall,  which  is  much  more  difficult  because  of  synonymy. 

Besides,  A-Wasp_1++  outperformed  PHARAOH++  when  the  training  set 
was  small.  This  indicates  that  despite  its  lower  coverage  compared  to  PHARAOH++, 
A-Wasp-1++  produced  NL  translations  that  were  consistently  more  accurate. 

Table  6.3  and  Figure  6.6  show  the  performance  of  A-Wasp-1++  on  the  mul¬ 
tilingual  Geoquery  data  set.  Similar  to  previous  results  on  the  larger  Geoquery 
data  set,  A-Wasp-1 ++  outperformed  PHARAOH++,  and  performed  comparably  to 
Wasp  1-t-+.  Also  consistent  with  previous  observations  (Section  5.4.4),  the  per¬ 
formance  of  A-Wasp_1++  is  the  lowest  for  Turkish,  followed  by  Japanese.  For 
English  and  Spanish,  the  performance  is  comparable. 

6.4  Chapter  Summary 

In  this  chapter,  we  described  a  tactical  generation  algorithm  that  translates 
logical  forms  into  NL  sentences.  This  algorithm  is  called  A-Wasp_1++.  It  can  be 
seen  as  the  inverse  of  the  A-Wasp  semantic  parser,  since  both  algorithms  are  based 
on  the  same  underlying  A-SCFG  grammar.  It  also  shares  the  same  log-linear  proba- 


128 


Figure  6.4:  Learning  curves  for  A-Wasp  *++  on  the  GEOQUERY  880  data  set 
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bilistic  model  with  Wasp_1++,  which  is  maximum-BLEU  trained.  We  described  a 
chart  generation  algorithm  that  can  handle  input  logical  forms,  and  a  fast  A; -best  de¬ 
coding  algorithm  for  efficient  maximum-BLEU  training.  Experiments  showed  that 
A-Wasp-1++  is  competitive  compared  to  other  MT-based  generators,  especially 
when  training  data  is  scarce. 
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Figure  6.6:  Learning  curves  for  A-Wasp  x++  on  multilingual  GEOQUERY  data 
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Chapter  7 


Future  Work 


In  this  chapter,  we  discuss  some  future  directions  for  the  research  presented 
in  this  thesis. 

7.1  Interlingual  Machine  Translation 

As  mentioned  in  Chapter  1,  an  application  of  semantic  parsers  and  tactical 
generators  is  interlingual  MT.  In  interlingual  MT,  source  texts  are  first  converted 
into  a  formal  MRL  that  is  language-independent,  called  an  interlingua.  From  such 
interlingual  representations,  target  texts  are  then  generated.  A  possible  interlingua 
in  the  Geoquery  domain,  for  example,  would  be  Prolog  logical  forms.  An  ad¬ 
vantage  of  interlingual  MT  over  direct  MT  is  economy  of  effort  in  a  multilingual 
environment:  While  direct  MT  requires  a  separate  system  for  each  language  pair , 
interlingual  MT  only  requires  a  parser  and  a  generator  for  each  language.  Moreover, 
for  structurally  dissimilar  language  pairs  such  as  Turkish  and  English,  interlingual 
MT  can  achieve  good  results  with  a  simpler  system  design  (Hakkani  et  ah,  1998). 
Early  knowledge-based,  interlingual  MT  systems  are  effective  in  restricted  domains 
with  limited  vocabulary  (Nyberg  and  Mitamura,  1992).  It  would  be  interesting  to 
see  how  statistical  interlingual  MT  systems  compare  against  state-of-the-art  direct 
MT  systems  (e.g.  Pharaoh)  in  restricted  domains  such  as  Geoquery. 

We  evaluated  a  simple  statistical  interlingual  MT  system  composed  of  A- 
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A-Wasp/A-Wasp_1++ 

Pharaoh 

Cov.  (%) 

Bleu 

Nist 

Cov.  (%) 

Bleu 

Nist 

Spanish-English 

90.4 

0.5415 

5.1790 

100.0 

0.7496 

6.6862 

Japanese-English 

90.0 

0.5255 

5.2691 

100.0 

0.5700 

5.6039 

Turkish-English 

77.6 

0.4431 

3.8735 

100.0 

0.6490 

6.1504 

Table  7.1:  Performance  of  MT  systems  on  multilingual  Geoquery  data 


A-Wasp/A-Wasp_1++ 

Pharaoh 

Bleu 

Nist 

Bleu 

Nist 

Spanish-English 

0.6215 

5.8076 

0.7836 

6.6443 

Japanese-English 

0.5930 

5.6748 

0.6149 

5.7997 

Turkish-English 

0.6218 

5.6258 

0.7503 

6.4653 

Table  7.2:  MT  performance  considering  only  examples  covered  by  both  systems 

Wasp  (Section  4.2)  and  A-Wasp_1++  (Section  6.2).  In  this  MT  system,  source 
sentences  are  converted  into  Prolog  logical  forms  using  A-Wasp.  Then,  the  Prolog 
logical  forms  are  translated  into  the  target  language  using  A-Wasp  '++.  For  each 
source  sentence,  only  the  best  Prolog  logical  form  is  used.  If  the  source  sentence 
cannot  be  converted  into  a  complete  Prolog  logical  form,  then  no  output  sentence 
will  be  generated. 

Table  7.1  shows  the  preliminary  results  on  the  multilingual  GEOQUERY 
data  set  using  10-fold  cross  validation,  where  the  best-performing  systems  based 
on  paired  t- tests  are  shown  in  bold.  Besides  the  Bleu  and  Nist  scores,  the  ta¬ 
ble  also  shows  the  percentage  of  test  examples  covered  by  the  MT  systems.  By 
most  measures,  Pharaoh  outperformed  the  interlingual  MT  system  (p  <  0.05 
based  on  paired  f -tests).  A  primary  reason  is  that  A-Wasp  often  could  not  analyze 
source  sentences  completely,  which  led  to  low  coverage.  However,  even  ignoring 
sentences  that  are  not  covered,  the  performance  of  the  interlingual  MT  system  is 
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Source:  ^Cuantas personas  viven  en  Spokane,  Washington? 

Reference:  How  many  people  live  in  Spokane,  Washington? 

A-Wasp/A-Wasp"1++:  What  is  the  population  of  Spokane,  WA? 

Figure  7.1:  Output  of  interlingual  MT  from  Spanish  to  English  in  GEOQUERY 

still  low  (Table  7.2).  There  are  two  contributing  factors  to  this.  First,  in  the  inter¬ 
lingual  MT  system,  the  parsing  and  generation  components  are  loosely  coupled,  so 
error  easily  propagates.  Second,  the  interlingua  may  fail  to  capture  certain  stylistic 
preferences  in  the  texts. 

Some  of  these  problems  could  be  easily  remedied.  To  improve  coverage,  we 
could  add  rules  to  A-Wasp  and  A-Wasp-1++  that  glue  partial  derivations  together 
(Chiang,  2005),  or  add  default  rules  for  previously  unseen  words.  To  reduce  error 
propagation,  we  could  have  A-Wasp  produce  multiple  analyses  of  a  source  sentence 
to  avoid  committing  to  a  particular  analysis.  A-Wasp-1++  could  then  be  used  to 
generate  the  best  overall  translation,  or  a  translation  that  covers  the  most  analyses 
(Knight  and  Langkilde,  2000).  To  make  sure  that  synonymous  expressions  do  not 
get  penalized  (e.g.  Figure  7.1),  we  could  elicit  more  reference  translations  for  each 
source  sentence  (Section  5.4.2),  or  perform  human  evaluation  (Section  5.4.3). 

A  more  fundamental  problem  is  designing  an  appropriate  interlingua  for  a 
particular  domain.  An  MRL  that  is  adequate  for  querying  databases  may  not  be 
adequate  for  interlingual  MT.  Moreover,  while  it  is  feasible  to  build  an  interlingual 
MT  system  for  specific  domains  such  as  medical  triage  (Gao  et  al.,  2006),  it  is 
much  more  difficult  for  broader  domains  such  as  newspaper  texts  (Knight  et  al., 
1995;  Farwell  et  al.,  2004).  This  is  because  to  describe  all  important  concepts  in  the 
world  requires  a  comprehensive  ontology,  but  such  knowledge  resources  are  very 
difficult  to  obtain.  However,  we  still  believe  that  translation  involves  understanding, 
and  interlingual  MT  is  the  right  approach.  As  we  mentioned  in  Chapter  1,  the  use 
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of  concise  interlingual  representations  can  improve  statistical  MT.  Likewise,  the 
ability  to  understand  unrestricted  texts  will  have  wide  implications  in  other  research 
areas  such  as  question  answering,  information  retrieval,  document  summarization, 
and  human-computer  interaction.  In  subsequent  sections,  we  will  discuss  some 
possible  research  avenues  that  would  allow  progress  toward  broad-domain  natural 
language  understanding  and  generation. 

7.2  Shallow  Semantic  Parsing 

Current  research  on  broad-domain  semantic  analysis  has  mainly  focused  on 
the  following  two  sub-tasks:  word  sense  disambiguation  and  semantic  role  labeling. 
Word  sense  disambiguation  (WSD)  is  to  identify  the  correct  meaning  (or  sense)  of  a 
word  in  context  (Lee  and  Ng,  2002).  Semantic  role  labeling  (SRL)  is  to  identify  the 
semantic  arguments  of  a  given  predicate  in  a  sentence  (Gildea  and  Jurafsky,  2002). 
These  two  tasks  are  closely  related.  For  example,  consider  the  following  sentence: 

The  robbers  tied  Peter  to  his  chair. 

To  identify  the  predicate-argument  structure  of  this  sentence,  we  need  to  determine 
the  correct  sense  of  the  word  tied,  which  is  “to  physically  attach”  in  this  case  (as  op¬ 
posed  to  “making  a  mental  connection”).  Once  the  predicate  is  correctly  identified, 
we  can  identify  its  arguments  (shown  in  brackets): 

[agent  The  robbers  ]  tied  [ITEM  Peter  ]  [GOAL  to  his  chair  ]  . 

Each  argument  takes  a  specific  role.  In  this  case,  the  robbers  are  the  AGENT  that 
causes  Peter  (the  item)  to  be  physically  attached  to  his  chair  (the  GOAL).  These 
roles  can  be  predicate-specific.  In  other  words,  the  sense  of  a  word  (e.g.  tied)  can 
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influence  the  roles  associated  with  it.  Conversely,  the  roles  associated  with  a  word 
can  influence  its  sense  as  well  (Lapata  and  Brew,  1999).  In  this  case,  the  fact  that 
a  chair  is  a  physical  GOAL  makes  it  more  likely  that  tied  means  “to  physically 
attach”.  These  word  senses  and  semantic  roles  are  defined  in  ontologies  such  as 
WordNet  (Fellbaum,  1998),  FrameNet  (Fillmore  et  al.,  2003),  and  Omega  (Philpot 
et  al.,  2005). 

Traditionally,  WSD  and  SRL  have  been  treated  as  two  separate  tasks:  WSD 
is  done  without  knowledge  of  the  semantic  roles  associated  with  a  word,  and  SRL  is 
done  assuming  that  the  predicate  has  been  correctly  identified  (Gildea  and  Jurafsky, 
2002),  or  assuming  that  the  semantic  roles  are  predicate-independent  as  in  Prop- 
Bank  (Palmer  et  al.,  2005).  We  argue  that  WSD  and  SRL  should  be  more  tightly 
coupled.  The  need  for  joint  inference  is  more  evident  when  we  consider  more  than 
one  predicate  in  a  sentence: 

[recipient  Mary  ]  got  [theme  [requirement  the  ingredients  ]  needed  [dependent 
to  make  [FOOd  ice-cream  ]  ]  ]  . 

In  this  sentence,  the  predicates  and  their  arguments  form  a  tree  structure.  However, 
current  SRL  methods  that  consider  one  predicate  at  a  time  cannot  capture  such 
interactions  among  predicates  (Carreras  and  Marquez,  2005;  Erk  and  Pado,  2006). 

There  has  been  some  preliminary  work  on  combining  WSD  with  SRL. 
Thompson  et  al.  (2003)  present  a  generative  model  that  performs  joint  WSD  and 
SRL  for  the  main  verb  of  a  sentence.  Erk  (2005)  reports  some  preliminary  results 
on  using  semantic  argument  information  of  a  word  to  improve  WSD. 

In  the  future,  we  would  like  to  explore  semantic  parsing  in  a  broad-domain 
setting.  Specifically,  we  would  like  to  combine  WSD  with  SRL  in  a  more  tightly- 
coupled  process.  The  semantic  parsing  task  is  shallow  in  the  sense  that  many  im- 
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portant  linguistic  phenomena  are  ignored,  such  as  quantification  and  tense.  Also 
words  can  be  left  unanalyzed  if  they  do  not  correspond  to  any  defined  concepts  in 
an  ontology.  Note  that  in  Wasp,  WSD  and  SRL  are  already  integrated  in  the  chart 
parsing  process.  We  believe  that  Wasp  could  be  adapted  to  handle  unrestricted 
texts,  by  treating  nested  predicate-argument  structures  as  the  target  MRL. 

The  following  semantically-annotated  corpora  could  be  used  for  broad-domain 
shallow  semantic  parsing.  Baker  et  al.  (2007)  have  recently  released  a  small  English 
corpus  based  on  FrameNet,  in  which  every  sentence  is  annotated  with  the  semantic 
arguments  for  all  predicates.  For  larger  corpora,  the  OntoNotes  project  (Weischedel 
et  al.,  2007)  is  an  ongoing  effort  to  produce  an  extended  version  of  English,  Chinese 
and  Arabic  Propbanks  annotated  with  word  sense  information  for  nouns  and  verbs, 
linked  to  the  Omega  ontology,  and  coreference. 

7.3  Beyond  Context-Free  Grammars 

Another  issue  related  to  broad-domain  semantic  analysis  is  the  prevalence 
of  long-distance  dependencies  in  unrestricted  texts.1  Fong-distance  dependencies 
occur  when  semantic  arguments  are  realized  outside  the  maximal  phrase  headed  by 
the  predicate.  Examples  include  the  following: 

[  The  dog  ]  which  they  had  just  bought  ran  away.  (Relative  clause) 

[  They  ]  are  hoping  to  secure  state  funding  this  year.  (Subject  control) 

[  This  record  ]  is  hard  to  beat.  (Tough-movement) 

'Long-distance  dependencies  are  not  very  common  in  the  restricted  domains  we  have  worked 
with.  For  example,  A- Wasp  outperforms  Zettlemoyer  and  Collins  (2007)  in  the  Geoquery  domain 
(Section  4.3.2),  although  the  latter  can  handle  long-distance  dependencies. 
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It  is  well  known  that  CFGs  cannot  easily  capture  long-distance  dependencies  (Levy 
and  Manning,  2004).  A  number  of  sophisticated  grammar  formalisms  that  can 
handle  such  dependencies  have  been  developed,  including  combinatory  catego- 
rial  grammars  (CCG)  and  tree- adjoining  grammars  (TAG).  CCGs  and  TAGs  are 
also  said  to  be  mildly  context-sensitive,  because  they  have  strictly  greater  gener¬ 
ative  capacity  than  CFGs,  yet  remain  polynomially  parsable  (Weir,  1988).  Re¬ 
cently,  Clark  and  Curran  (2004)  released  a  highly  efficient  wide-coverage  CCG 
parser,  which  provides  an  attractive  alternative  to  traditional  statistical  CFG  parsers 
(Collins,  1997;  Charniak,  2000). 

Existing  work  on  semantic  analysis  using  CCGs  and  TAGs  mostly  involves 
hand- written  components  that  are  language-specific  (Shieber  and  Schabes,  1990b; 
Bos,  2005;  Zettlemoyer  and  Collins,  2007).  In  the  future,  we  would  like  to  devise 
learning  algorithms  similar  to  WASP  that  construct  synchronous  CCGs  and  TAGs 
given  training  data  in  any  language.  Such  synchronous  grammars  can  be  useful  in 
natural  language  generation  and  machine  translation  as  well  (Shieber  and  Schabes, 
1990a;  Shieber,  2007).  Our  goal  is  to  extract  synchronous  grammars  from  parallel 
corpora  with  limited  or  no  syntactic  annotations.  For  this,  previous  work  on  extract¬ 
ing  CCGs  and  TAGs  from  non-CCG  or  TAG-annotated  corpora  would  be  relevant 
(Hockenmaier  and  Steedman,  2002;  Chen  et  al.,  2006). 

7.4  Using  Ontologies  in  Semantic  Parsing 

The  research  presented  in  this  thesis  illustrates  the  importance  of  domain 
knowledge  in  semantic  parsing  and  natural  language  generation.  Specifically,  in 
all  of  the  WASP-based  systems,  domain  knowledge  comes  in  the  form  of  an  MRL 
grammar  that  defines  a  set  of  possible  MRs  in  a  particular  domain.  However,  not 
all  information  can  be  conveniently  encoded  in  an  MRL  grammar,  and  for  broad- 
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domain  semantic  analysis,  knowledge  bases  such  as  FrameNet  and  Omega  can  be 
very  useful.  An  interesting  question  would  be  how  to  effectively  use  the  knowledge 
encoded  in  these  ontologies  in  a  statistical  semantic  parsing  framework. 

On  the  other  hand,  knowledge  gleaned  from  texts  can  also  be  integrated 
with  existing  ontologies,  which  can  be  useful  for  understanding  further  texts  in  the 
same  domain  (Barker  et  ah,  2007).  In  other  words,  natural  language  understanding 
and  knowledge  acquisition  can  form  a  tightly-coupled  cycle,  where  knowledge  is 
accumulated  by  reading  a  given  corpus  of  unannotated  texts.  This  would  allow 
knowledge  acquisition  on  a  truly  large  scale,  and  can  lead  to  automated  systems 
that  learn  natural  languages  like  humans,  using  basic  prior  knowledge  to  bootstrap 
the  learning  process.  To  combine  natural  language  understanding  and  knowledge 
acquisition  in  a  robust  statistical  framework  is  therefore  a  very  interesting  problem, 
which  we  intend  to  pursue  in  the  future. 
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Chapter  8 


Conclusions 


In  this  thesis,  we  focused  on  two  sub-tasks  of  natural  language  understand¬ 
ing  and  generation,  namely  semantic  parsing  and  tactical  generation.  Semantic 
parsing  is  the  task  of  transforming  natural-language  sentences  into  formal  symbolic 
meaning  representations  (MR),  and  tactical  generation  is  the  inverse  task  of  trans¬ 
forming  formal  MRs  into  sentences.  We  presented  a  number  of  novel  statistical 
learning  algorithms  for  semantic  parsing  and  tactical  generation.  These  algorithms 
automatically  learn  all  of  their  linguistic  knowledge  from  annotated  corpora,  and 
can  handle  sentences  that  are  conceptually  complex. 

The  key  idea  of  this  thesis  is  that  since  both  semantic  parsing  and  tacti¬ 
cal  generation  are  essentially  language  translation  tasks  between  natural  languages 
(NL)  and  formal  meaning  representation  languages  (MRL),  both  can  be  tackled 
using  state-of-the-art  statistical  machine  translation  (MT)  techniques.  Specifically, 
we  introduced  a  learning  algorithm  for  semantic  parsing  called  Wasp  (Chapter  3), 
based  on  a  technique  called  synchronous  parsing,  which  has  been  extensively  used 
in  syntax-based  statistical  MT.  The  underlying  grammar  of  Wasp  is  a  weighted 
synchronous  context-free  grammar  (SCFG)  extracted  from  an  automatically  word- 
aligned  parallel  corpus  consisting  of  NL  sentences  and  their  correct  MRs,  with  the 
help  of  an  unambiguous  context-free  grammar  of  the  target  MRL.  The  weights  of 
the  SCFG  define  a  log-linear  distribution  over  its  derivations.  The  Wasp  algo¬ 
rithm  is  designed  to  handle  variable-free  MRLs,  as  exemplified  by  CLANG,  the 
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RoboCup  coach  language  (Section  2.1).  We  empirically  evaluated  the  effective¬ 
ness  of  Wasp  in  two  real-world  domains,  Geoquery  and  RoboCup,  and  in  four 
different  NLs,  namely  English,  Spanish,  Japanese  and  Turkish.  Experimental  re¬ 
sults  showed  that  the  performance  of  Wasp  is  competitive  compared  to  the  cur¬ 
rently  best  methods  requiring  similar  supervision. 

In  Chapter  4,  we  extended  the  Wasp  semantic  parsing  algorithm  to  handle 
MRLs  such  as  predicate  logic,  on  which  most  existing  work  on  formal  semantics 
and  computational  semantics  is  based.  The  resulting  algorithm,  A- Wasp,  uses  an 
extended  version  of  SCFG  called  A-SCFG,  in  which  logical  forms  are  generated 
using  the  lambda  calculus.  We  proposed  a  learning  algorithm  similar  to  Wasp, 
which  learns  a  weighted  A-SCFG  from  a  parallel  corpus  consisting  of  NF  sentences 
paired  with  their  correct  logical  forms.  We  further  refined  the  learning  algorithm 
through  transformation  of  logical  forms  and  language  modeling  for  target  MRFs. 
Using  the  same  amount  of  supervision,  A- Wasp  was  shown  to  significantly  out¬ 
perform  Wasp,  and  is  currently  one  of  the  best  semantic  parsing  algorithms  in  the 
Geoquery  domain. 

For  tactical  generation,  we  proposed  several  learning  methods  for  variable- 
free  MRFs  using  statistical  MT  (Chapter  5).  We  presented  results  on  using  a  re¬ 
cent  phrase-based  statistical  MT  system  called  Pharaoh  for  tactical  generation. 
We  also  showed  that  the  Wasp  semantic  parsing  algorithm  can  be  inverted  to  pro¬ 
duce  a  tactical  generation  system  called  Wasp-1.  This  approach  allows  the  same 
learned  grammar  to  be  used  for  both  parsing  and  generation.  Also  it  allows  the  chart 
parser  in  Wasp  to  be  used  for  generation  with  minimal  modifications.  While  rea¬ 
sonably  effective,  both  Pharaoh  and  Wasp-1  can  be  substantially  improved  by 
borrowing  ideas  from  each  other.  The  resulting  hybrid  systems,  PHARAOH++  and 
Wasp- 1++,  were  shown  to  be  much  more  robust  and  accurate,  based  on  automatic 
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and  human  evaluations  in  the  Geoquery  and  RoboCup  domains.  In  particular, 
the  SCFG-based  hybrid  system  Wasp-1,  produced  by  inverting  Wasp  and  incor¬ 
porating  Pharaoh’s  probabilistic  model,  was  shown  to  be  the  best  overall  among 
the  four  proposed  systems. 

Lastly,  we  extended  the  Wasp-1++  tactical  generation  algorithm  to  handle 
predicate  logic  (Chapter  6).  The  resulting  algorithm,  A-Wasp-1++,  shares  the  same 
underlying  A-SCFG  grammar  with  A-Wasp,  and  the  same  probabilistic  model  with 
Wasp- 1++.  We  presented  a  chart  generation  algorithm  that  can  handle  input  logical 
forms.  Experiments  showed  that  A-Wasp-1++  is  competitive  compared  to  other 
MT-based  generators,  especially  when  training  data  is  scarce. 

Overall,  the  research  presented  in  this  thesis  has  made  significant  contribu¬ 
tions  to  natural  language  processing  in  the  following  two  aspects.  First,  while  the 
use  of  a  single  grammar  for  both  parsing  and  generation  has  long  been  advocated  for 
its  elegance,  and  several  implementations  of  this  idea  have  already  existed  (Section 
2.3.1),  our  work  is  the  first  attempt  to  use  the  same  automatically-learned  grammar 
for  both  parsing  and  generation.  Our  WASP-based  parsers  and  generators  acquire  all 
of  their  linguistic  knowledge  from  annotated  corpora,  unlike  other  existing  systems 
that  require  manually-constructed  grammars  and  lexicons  (e.g.  Carroll  and  Oepen, 
2005).  Therefore,  our  WASP-based  systems  require  much  less  tedious  domain- 
specific  knowledge  engineering,  and  can  be  easily  ported  to  other  languages  and 
application  domains. 

Second,  while  our  MT-based  parsers  and  generators  have  only  been  empir¬ 
ically  tested  in  restricted  domains  such  as  Geoquery,  our  work  represents  an  im¬ 
portant  step  toward  broad-domain  natural  language  understanding  and  generation. 
There  is  no  reason  to  believe  that  similar  MT-based  approaches  cannot  be  used 
for  understanding  and  generating  unrestricted  texts,  as  statistical  MT  systems  with 
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massive  amounts  of  training  data  have  already  demonstrated  the  ability  to  translate 
between  a  wide  variety  of  languages.  As  argued  in  Chapter  7,  there  are  three  major 
challenges  to  solve:  (1)  devising  a  suitable  MRL  for  a  broad  array  of  applications, 
such  as  question  answering  and  interlingual  MT,  (2)  acquiring  a  knowledge  repos¬ 
itory  that  captures  all  important  concepts  in  the  world,  and  (3)  gathering  enough 
training  data  for  effective  statistical  learning.  Solving  these  problems  will  require 
major  breakthroughs  in  areas  such  as  knowledge  representation  and  reasoning,  ma¬ 
chine  learning,  natural  language  processing,  and  data  mining.  However,  we  expect 
that  statistical  MT  methods  will  still  be  relevant  because  the  basic  problem  of  map¬ 
ping  NL  expressions  to  concepts  will  remain  the  same.  We  can  see  plenty  there  that 
needs  to  be  done,  but  at  least  we  can  see  the  road  ahead. 
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Appendix  A 


Grammars  for  Meaning  Representation  Languages 


This  appendix  describes  the  grammars  for  all  of  the  formal  MRLs  consid¬ 
ered  in  this  thesis,  namely  the  Geoquery  logical  query  language,  the  Geoquery 
functional  query  language  (FunQL),  and  CLANG  (Section  2.1).  These  formal 
MRL  grammars  are  used  to  train  various  semantic  parsers  and  tactical  generators, 
including  all  WASP-based  systems  and  the  PHARAOH++  tactical  generator  (Section 
5.3.1). 

A.l  The  Geoquery  Logical  Query  Language 

The  Geoquery  logical  query  language  was  devised  by  Zelle  (1995,  Sec. 
7.3)  for  querying  a  U.S.  geography  database  called  Geoquery.  Since  the  database 
was  written  in  Prolog,  the  query  language  is  basically  first-order  Prolog  logical 
forms,  augmented  with  several  meta-predicates  for  dealing  with  quantification. 

There  are  14  different  non-terminal  symbols  in  this  grammar,  of  which 
Query  is  the  start  symbol.  The  following  non-terminal  symbols  are  for  entities 
referenced  in  the  Geo  QUERY  database: 
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Entity  types 

Non-terminals 

Sample  productions 

City  names 

CityName 

CityName  -a- 

aust in 

Country  names 

CountryName 

CountryName  — ►  usa 

Place  names 

PlaceName 

PlaceName  - 

■> tahoe 

(lakes,  mountains,  etc.) 

River  names 

RiverName 

RiverName  — 

>  mississippi 

State  abbreviations 

StateAbbrev 

StateAbbrev 

— >  tx 

State  names 

StateName 

StateName  — 

texas 

Numbers 

NUM 

num  — >  o 

The  following  non-terminals  are  used  to  disambiguate  between  entities  that  share 
the  same  name  (e.g.  the  state  of  Mississippi  and  the  Mississippi  river).  Note  the 
corresponding  Prolog  functors  (e.g.  stateid  and  riverid): 


Entity  types 

Non-terminals 

Productions 

Cities 

City 

City  — >  city  id  (CityName,  StateAbbrev) 
City  — >  cityid  (CityName,  _) 

Countries 

Country 

Country  -a-  countryid  (CountryName) 

Places 

Place 

Place  — >  placeid  (PlaceName) 

Rivers 

River 

River  -a-  riverid  (RiverName) 

States 

State 

State  — >  stateid  (StateName) 

The  Form  non-terminal  (short  for  “formula”)  is  for  the  following  first-order  predi¬ 
cates,  which  provide  most  of  the  expressiveness  of  the  Geoquery  language.  Note 
that  xi,  x2,  .  .  •  are  logical  variables  that  denote  entities: 


Form 

Form 

Form 

Form 

Form 

Form 


Productions 


Meaning  of  predicates 


capital  ( X\ ) 
city  (xi) 
country  (xi ) 
lake  (xi ) 
major  {x\ ) 
mountain  (xi ) 


xi  is  a  capital  (city), 
xi  is  a  city. 

Xi  is  a  country. 

Xi  is  a  lake. 

xi  is  major  (as  in  a  major  city  or  a  major  river), 
xi  is  a  mountain. 
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Form 

Form 

Form 

Form 

Form 

Form 

Form 

Form 

Form 

Form 

Form 

Form 

Form 

Form 

Form 

Form 

Form 

Form 

Form 


Productions 
place  (ay ) 
river  (ay ) 
state  (ay ) 
area  (ay,  X2) 
capital  (xi,  X2) 
density  (ay,  X2) 
elevation  (ay ,  ay ) 
elevation  (ay,  Num) 
high.point  (ay,  ay) 
higher  (xi,  x2) 
len  (ay ,  ay ) 
loc  (Xi,X2) 
longer  (ay,  x2) 
low.point  (ay,  ay) 
lower  (ay,  ay) 
nextrto  (ay,  x2) 
population  (ay, ay) 
size  {xi,  x2) 
traverse  (ay ,  ay ) 


Meaning  of  predicates 
x\  is  a  place, 
ay  is  a  river. 
xi  is  a  state. 

The  area  of  ay  is  x2. 

The  capital  of  ay  is  ay. 

The  population  density  of  ay  is  x2. 

The  elevation  of  ay  is  x2. 

The  elevation  of  is  Num. 

The  highest  point  of  ay  is  ay. 

The  elevation  of  ay  is  greater  than  that  of  ay. 
The  length  of  x\  is  x2. 
x\  is  located  in  x2. 

The  length  of  ay  is  greater  than  that  of  ay. 
The  lowest  point  of  ay  is  x2. 

The  elevation  of  ay  is  less  than  that  of  ay. 
ay  is  adjacent  to  ay. 

The  population  of  ay  is  ay. 

The  size  of  ay  is  ay. 
ay  traverses  ay. 


The  following  m-tuples  are  used  to  constrain  the  combinations  of  entity  types  that 
the  arguments  of  a  m-place  predicate  can  denote.  See  Section  4.2.5  for  how  to  use 
these  m-tuples  for  type  checking: 


Predicates 

Possible  entity  types  for  logical  variables 

capital  (ay ) 

(city),  (place) 

city  (ay) 

(city) 

country  (ay ) 

(country) 

lake  (ay ) 

(place),  (lake) 

major  (ay) 

(city),  (lake),  (river) 
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Predicates 
mountain  (mi) 
place  (mi ) 
river  (mi ) 
state  (mi ) 
area  (mi,  m2) 
capital  (mi,  m2) 
density  (mi,  m2) 
elevation  (mi, m2) 
elevation  (mi,  Num) 
high_point  (mi,m2) 

higher  (mi,  m2) 

len  (mi,  m2) 
loc  (mi,  m2) 


longer  (mi,  m2) 
low.point  (mi,  m2) 

lower  (mi,  m2) 

nextrto  (mi,  m2) 
population  (mi, m2) 
size  (mi,  m2) 


Possible  entity  types  for  logical  variables 

(place),  (mountain) 

(place),  (lake),  (mountain) 

(river) 

(state) 

(city,  num),  (country,  num),  (state,  num) 
(state,  city) 

(city,  num),  (country,  num),  (state,  num) 
(place,  num),  (mountain,  num) 

(place),  (mountain) 

(country,  place),  (country,  mountain), 
(state,  place),  (state,  mountain) 

(place,  place),  (place,  mountain), 

(mountain,  place),  (mountain,  mountain) 
(river,  num) 

(city,  country),  (place,  country), 

(lake,  country),  (mountain,  country), 

(river,  country),  (state,  country), 

(city,  state),  (place,  state),  (lake,  state), 
(mountain,  state),  (river,  state),  (place,  city) 
(river,  river) 

(country,  place),  (country,  mountain), 
(state,  place),  (state,  mountain) 

(place,  place),  (place,  mountain), 

(mountain,  place),  (mountain,  mountain) 
(state,  river),  (state,  state) 

(city,  num),  (country,  num),  (state,  num) 
(city,  num),  (country,  num),  (place,  num), 
(lake,  num),  (mountain,  num),  (river,  num), 
(state,  num) 
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Predicates 

Possible  entity  types  for  logical  variables 

traverse  (aq,  aq) 

(river,  city),  (river,  country),  (river,  state) 

In  addition,  the  equal  predicate  is  used  to  equate  logical  variables  to  ground  terms, 
e.g.  equal  (aq,  cityid  (austin,  tx)  )  : 


Productions 

Possible  entity  types  for  logical  variables 

Form  - 

■*  equal  (aq,  CITY) 

(city) 

Form  - 

■*  equal  (aq,  COUNTRY) 

(country) 

Form  - 

->  equal  (aq,  PLACE) 

(place),  (lake),  (mountain) 

Form  - 

■*  equal  (aq,  RIVER) 

(river) 

Form  - 

->  equal  (aq,  STATE) 

(state) 

Another  important  production  is  the  conjunction  operator  (, ),  which  is  used  to  form 
conjunctions  of  formulas: 

Form  — >  (Form,  Form) 

The  not  operator  is  used  to  form  negations: 

Form  — >  not  (Form) 


The  Form  non-terminal  is  also  for  the  following  meta-predicates,  which  take  con¬ 
junctive  goals  as  their  arguments: 


Form 

Form 

Form 

Form 


Productions 
largest  (aq,FORM) 

smallest  (aq ,  FORM) 

highest  (aq,FORM) 
lowest  (aq,  Form) 


Meaning  of  meta-predicates 
The  goal  denoted  by  Form  produces  only 
the  solution  maximizing  the  size  of  x\. 

The  goal  denoted  by  Form  produces  only 
the  solution  minimizing  the  size  of  aq. 
Analogous  to  largest  (with  elevation). 
Analogous  to  smallest  (with  elevation). 
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Form 

Form 

Form 

Form 

Form 

Form 


Productions 
longest  (xi,FORM) 
shortest  (Xi,FORM) 
count  (xi.  Form,  x2) 

sum  (x\,  Form,  x2) 

most  (xi,  x2,  Form) 

fewest  (xi,  x2,  Form) 


Meaning  of  meta-predicates 
Analogous  to  largest  (with  length). 
Analogous  to  smallest  (with  length). 
x2  is  the  number  of  bindings  for  x\  satisfying 
the  goal  denoted  by  Form. 
x2  is  the  sum  of  all  bindings  for  x,\  satisfying 
the  goal  denoted  by  Form. 

The  goal  denoted  by  Form  produces  only 
the  Xi  maximizing  the  count  of  x2. 

The  goal  denoted  by  Form  produces  only 
the  X\  minimizing  the  count  of  x2. 


Below  are  the  corresponding  m- tuples  of  entity  types  for  type  checking: 


Meta-predicates 

Possible  entity  types  for  logical  variables 

largest  (xi,  FORM) 

(city),  (place),  (lake),  (mountain),  (num), 
(river),  (state) 

smallest  (xi,  FORM) 

(city),  (place),  (lake),  (mountain),  (num), 
(river),  (state) 

highest  (x\ ,  FORM) 

(place),  (mountain) 

lowest  (xi.  Form) 

(place),  (mountain) 

longest  (xi ,  FORM) 

(river) 

shortest  (xi,  FORM) 

(river) 

count  (xi.  Form, x2) 

(*,  num) 

sum  (xi,  FORM,x2) 

(num,  num) 

most  (xi,  x2.  Form) 

(*,*) 

fewest  (xi ,  x2,  FORM) 

(*,*) 

In  the  above  table,  *  denotes  any  of  these  entity  types:  CITY,  COUNTRY,  PLACE, 
LAKE,  MOUNTAIN,  NUM,  RIVER,  STATE. 


Finally,  the  start  symbol,  Query,  is  reserved  for  the  answer  meta-predicate, 
which  serves  as  a  wrapper  for  query  goals  (denoted  by  Form): 
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QUERY  — >  answer  (aq,  FORM) 


Here  xx  is  the  logical  variable  whose  binding  is  of  interest  (i.e.  answers  the  question 
posed).  x\  can  denote  entities  of  any  type  (*). 

A.2  The  Geoquery  Functional  Query  Language 

For  semantic  parsers  and  tactical  generators  that  cannot  handle  logical  vari¬ 
ables  (e.g.  Wasp,  Pharaoh++,  Wasp_1++),  a  variable-free,  functional  query  lan¬ 
guage  called  FunQL  has  been  devised  for  the  Geoquery  domain  (Kate  et  al., 
2005).  Below  is  a  sample  FunQL  query,  together  with  its  corresponding  Prolog 
logical  form: 

What  are  the  cities  in  Texas? 

FunQL:  answer (city (loc_2 (stateid (texas) ) ) ) 

Prolog  logical  form:  answer  (aq,  (city  (aq)  ,  loc  (aq,  x2)  , 
equal  (aq,  stateid  (texas)  )  )  ) 

In  Section  2.1,  we  noted  that  FunQL  predicates  can  have  a  set-theoretic  inter¬ 
pretation.  For  example,  the  term  stateid  (texas )  denotes  a  singleton  set  that 
consists  of  the  Texas  state,  and  loc_2  (stateid  (texas)  )  denotes  the  set  of 
entities  located  in  the  Texas  state,  and  so  on.  Here  we  present  another  interpre¬ 
tation  of  FunQL  based  on  the  lambda  calculus.  Under  this  interpretation,  each 
FunQL  predicate  is  a  shorthand  for  a  A-function,  which  can  be  used  to  translate 
FunQL  expressions  into  the  GEOQUERY  logical  query  language  through  function 
application.  For  example,  the  FunQL  predicate  stateid  denotes  the  A-function 
An. Aaq. equal  (aq,  stateid  (n)  )  .  Hence  by  function  application,  the  FunQL 
term  stateid  (texas)  is  equivalent  to  the  following  logical  form  in  the  Geo- 
QUERY  logical  query  language: 


151 


Aaq. equal  (aq,  stateid  (texas)  ) 


Also  since  the  FunQL  predicate  loc_2  denotes  Ap.Aaq.  (loc  (xi,x2)  ,p(x 2)) , 
the  FunQL  term  loc_2  ( stateid  (texas )  )  is  equivalent  to: 

Aaq.loc  (X\,X2)  ,  equal  (X2,  stateid  (texas )  )  ) 

There  are  13  different  non-terminal  symbols  in  the  FunQL  grammar.  All  of  them 
are  from  the  Geoquery  logical  query  language.  Only  the  Form  non-terminal  is 
not  used  in  FunQL.  Query  is  the  start  symbol  in  the  FunQL  grammar. 

Below  are  the  FunQL  productions  for  named  entities  and  numbers,  which 
are  identical  to  those  in  the  Geoquery  logical  query  language: 


Entity  types 

Sample  productions 

Corresponding  \-functions 

City  names 

CityName  — >  austin 

austin 

Country  names 

CountryName  — >  usa 

usa 

Place  names 

PlaceName  — »■  tahoe 

tahoe 

River  names 

RlVERNAME  — >  mississippi 

mississippi 

State  abbreviations 

StateAbbrev  -»  tx 

tx 

State  names 

StateName  — >  texas 

texas 

Numbers 

NUM  -a-  0 

0 

The  rest  of  the  FunQL  productions  are  as  follows: 


Productions 

Corresponding  \-functions 

City  — > 

An. Xa.Xxi. equal  (aq,  cityid  (n,  a)  ) 

cityid  (CityName,  StateAbbrev) 

City  — >  cityid  (CityName,  _) 

Xn.Xxi. equal  (aq ,  cityid  (n,  _)  ) 

Country  — > 

An. Aaq. equal  (aq,  countryid  (n)  ) 

countryid  (COUNTRYNAME) 

Place  -a  placeid  (PlaceName) 

An.Axi. equal  (xi,  placeid  (n)  ) 

River  — >  riverid  (RiverName) 

An.Axi. equal  {x\,  riverid  (n)  ) 

State  -a-  stateid  (StateName) 

An. Aaq. equal  (aq ,  stateid  (n)  ) 
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Productions 

Corresponding  X-functions 

City  — ►  capital  (all) 

Ami. capital  (x\ ) 

City  — >  city  (all) 

Ami. city  (mi ) 

Country  — »  country  (all) 

Ami. country  (x\ ) 

Place  — >  lake  (all) 

Ami. lake  (mi ) 

PLACE  — ►  mountain  (all ) 

Ami.mountain  (mi) 

Place  — >  place  (all) 

Ami  .place  (a>i ) 

River  — >  river  (all) 

Ami. river  (mi ) 

State  — >  state  (all) 

Ami.state  (mi ) 

City  — >  capital  (City) 

Ap.Ami.  (capital  (mi)  ,p(x  1)) 

City  — >  capital  (Place) 

Ap.Ami.  (capital  (mi)  ,p(x i)) 

City  — ►  city  (City) 

Ap.Ami.  (city  (mi)  ,p(mi)) 

Place  — >  lake  (Place) 

Ap.Ami.  (lake  (mi)  ,p(x i)) 

City  — >  major  (City) 

\p.\x\.  (major  (mi )  , p(x i) ) 

Place  — >  major  (Place) 

Ap.Ami.  (major  (mi )  , p(x i) ) 

River  — >  major  (River) 

Xp.Xxi.  (major  (mi )  ,  p(x i) ) 

PLACE  — >  mountain  (PLACE) 

Xp.Xx\.  (mountain  (mi)  ,p(x i)) 

Place  — +  place  (Place) 

Xp.Xx\.  (place  (mi)  ,p(x i)) 

River  — »  river  (River) 

Xp.Xx\.  ( river  (mi )  ,  p(x i) ) 

State  — >  state  (State) 

Xp.Xx\.  (state  (mi)  ,p(mi)) 

Num  — >  area_l  (City) 

Xp.Xxi.  (area  (m2,mi)  ,p(m2)) 

Num  — >  area_l  (COUNTRY) 

Ap.Ami.  (area  (m2,  mi)  ,p(m2)) 

Num  — >  area_l  (Place) 

Xp.Xxi.  (area  (m2,mi)  ,p(m2)) 

Num  — »■  area.l  (State) 

Xp.Xxi.  (area  (m2,  mi)  ,p(m2)) 

City  — >  capital-1  (Country) 

Xp.Xxi.  (capital  (m2,mi)  , p(m2)) 

City  — >  capital-1  (State) 

Ap.Ami.  (capital  (m2,  mi)  ,p(m2)) 

State  — ►  capital_2  (City) 

Xp.Xxi.  (capital  (mi,  m2)  , p(m2)) 

Num  — »  density_l  (City) 

Xp.Xxi.  (density  (m2,  mi)  ,p(m2)) 

Num  — ►  density_l  (Country) 

Xp.Xxi.  (density  (m2,  mi)  ,p(m2)) 

Num  -a-  density_l  (State) 

Xp.Xxi.  (density  (m2,  mi)  ,p(m2)) 

Num  — +  elevation-1  (Place) 

Xp.Xxi.  (elevation  (m2,  mi)  ,p(m2)) 

Place  -a  elevation_2  (Num) 

An. Ami. elevation  (mi ,  n) 
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Productions 

Place  — >  high_point_l  (State) 
State  — >  high_point_2  (Place) 
Place  — 4  higher_2  (PLACE) 

Num  — v  len  (River) 

City  — > loc_l (Place) 

Country  — >•  loc_l  (City) 
Country  — >  ioc_i  (Place) 
Country  — >  ioc_l  (River) 
Country  — >  ioc_l  (State) 

State  — >  ioc_l  (City) 

State  — >  loc_l  (Place) 

State  — >  loc_l  (River) 

City  — >  loc_2  (Country) 

City  — >  loc_2  (State) 

Place  — >  loc_2  (City) 

Place  — >  loc_2  (State) 

Place  — >  ioc_2  (Country) 

River  — >  loc_2  (Country) 

River  — >  loc_2  (State) 

State  — >  loc_2  (Country) 

River  —>  longer  (River) 

Place  — >  iower_2  (Place) 

State  -4  next_to_l  (State) 

State  — >  next_to_2  (State) 

State  — >  next_to_2  (River) 

Num  — >  population_l  (City) 

Num  population_l  (Country) 
Num  — ►  population_l  (State) 
Num  -4  size  (City) 

Num  — 4  size  (Country) 

Num  -4  size  (State) 


Corresponding  X-functions 
Xp.Xxi.  (high.point  (x2r  %i)  ,  p(x  2) ) 
Xp.Xx±.  (high.point  (x\ ,  X2 )  ,  p(x 2) ) 
Xp.Xxi.  (higher  (xi,  x2)  ,p{x2)) 
Xp.Xxi.  (len  (x2rx\)  ,p(x 2)) 

Xp.Xxi.  ( loc  (x2rxi)  ,p(x 2)) 

Xp.Xxi.  doc  (x2rxi)  ,p(x2)) 

Xp.Xxi.  (loc  (x2,xi)  ,p(x2)) 

Xp.Xxi.  doc  (x2rxi)  ,p(x2)) 

Xp.Xxi.  (loc  (x2rxi)  ,p(x2)) 

Xp.Xxi.  doc  (x2,xi)  ,p(x2)) 

Xp.Xxi.  (loc  (x2,xi)  ,p(x2)) 

Xp.Xxi.  doc  (x2rxi)  ,p(x2)) 

Xp.Xxi.  (loc  (xi,x2)  ,p(x2)) 

Xp.Xxi.  (loc  (xi,  x2)  ,p(x2)) 

Xp.Xxi.  (loc  (xi,x2)  ,p(x2)) 

Xp.Xxi.  (loc  (xi,x2)  ,p(x 2)) 

Xp.Xxi.  (loc  (xi,  x2)  ,p(x 2)) 

Xp.Xxi.  (loc  (xi,x2)  ,p(x2)) 

Xp.Xxi.  (loc  (xi,  x2)  ,p(x2)) 

Xp.Xxi.  (loc  (xi,x2)  ,p(x2)) 

Xp.Xxi.  (longer  (xi,x2)  ,p(x2)) 
Xp.Xxi.  (lower  (xi,x2)  ,p{x2)) 
Xp.Xxi.  (nextrto  (x2,  xi)  ,p(x2)) 
Xp.Xxi.  (nextrto  (xi  ,x2),  p(x  2) ) 
Xp.Xxi.  (next_to  (xi  ,x2),  p(x  2) ) 
Xp.Xxi.  (population  (x2,  xi)  ,p(x2)) 
Xp.Xxi.  (population  (x2,  xi)  ,p(x 2)) 
Xp.Xxi.  (population  (x2,  xi)  ,p(x2)) 
Xp.Xxi.  (size  (x2,xi)  ,p{x2)) 

Xp.Xxi.  (size  (x2,xi)  ,p{x2)) 

Xp.Xxi.  (size  (x2,xi)  ,p(x2)) 
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Productions 

Corresponding  X-functions 

City  — >  traverse.l  (River) 

Xp.Xxi.  (traverse  (X2,X\)  ,p(x 2)) 

Country  — >  traversed  (River) 

Xp.Xxi.  (traverse  (X2,xi)  ,p(x 2)) 

State  traversed  (River) 

Xp.Xx\.  (traverse  (X2,xi)  ,p(x 2)) 

River  — >  traversed  (City) 

Xp.Xxi.  (traverse  (x\,X2)  ,p{x 2)) 

River  — >  traversed  (Country) 

Xp.Xxi.  (traverse  {x\,X2)  ,p{x 2)) 

River  — ►  traversed  (State) 

Xp.Xxi.  (traverse  (xi,X2)  ,p{x 2)) 

City  — >  largest  (City) 

Xp.Xxi. largest  (xi,p{x\)) 

Place  — >  largest  (Place) 

Xp.Xxi. largest  (xi,p(xi)) 

State  — >  largest  (State) 

Xp.Xxi. largest  (xi,p(xi)) 

State  -a 

Xp.Xxi. largest  (X2, 

largest_one  (aread  (STATE)  ) 

(area  (xi,x2)  ,p{x  1))  ) 

City  — > 

Xp.Xxi. largest  (X2, 

largest_one  (density.l  (CITY)  ) 

(density  (ag,  x2)  ,p{x  1))  ) 

State  -a 

Xp.Xxi. largest  (X2, 

largest_one  (density.l  (STATE)  ) 

(density  {xi,  x2)  ,p{x  1))  ) 

City  —> 

Xp.Xxi. largest  (X2, 

largest_one  (population_l  (CITY)  ) 

(population  (xi,  X2)  ,p{x  1))  ) 

State  -a 

Xp.Xxi. largest  (X2, 

largest_one  (population_l  (STATE)  ) 

(population  {xi,  X2)  ,p{x  1))  ) 

City  — *  smallest  (City) 

Xp.Xxi. smallest  {xi,p(xi)) 

Num  —>  smallest  (Num) 

Xp.Xxi. smallest  (xi,p(xi)) 

Place  — >  smallest  (Place) 

Xp.Xxi. smallest  (xi,p(xi)) 

State  -a  smallest  (State) 

Xp.Xxi. smallest  (xi,p(xi)) 

State  -a- 

Xp.Xxi. smallest  (X2, 

smallest.one  (aread  (State)  ) 

(area  (xi,x2)  ,p(x  1))  ) 

State  -a 

Xp.Xxi. smallest  (X2, 

smallest_one  (density_l  (State)  ) 

(density  (xi,  x2)  ,p{x  1))  ) 

City  -a 

Xp.Xxi. smallest  (X2, 

smallest.one  (population.!  (City)  ) 

(population  (xi,  X2)  ,  p{x  1) )  ) 

State  -a 

Xp.Xxi. smallest  (X2, 

smallest_one  (population.!  (State)  ) 

(population  [xi,  X2)  ,p{x  1))  ) 
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Productions 

Corresponding  X-functions 

Place 

—4 highest (PLACE) 

Xp.Xxi. highest  (xi,p(xi)) 

Place 

— >  lowest (Place) 

Xp.Xxi. lowest  (xi,p(xi)) 

River 

— >  longest  (RIVER) 

Xp.Xxi. longest  (xi,p(xi)) 

River 

— >  shortest  (RIVER) 

Xp.Xxi. shortest  (xi,p(xi)) 

Num  - 

4  count  (City) 

Xp. Ami. count  {X2,p(x2)r  x\) 

Num  - 

4  count (Place) 

Xp.Xx\. count  (X2,p{x2) ,  xi) 

Num  - 

4  count  (River) 

Ap.Axi. count  (X2,p{x2) ,  x\) 

Num  - 

4  count (State) 

Xp.Xx\. count  (X2,p{x2) ,  x\) 

Num  - 

4  sum  (Num) 

Ap.Axi.sum  (x2,p(x2),  xi) 

City  - 

4  most  (City) 

Ap'.Axi.most  (xi,  x'  ,p'(x  1)) ,  where 
p'  contains  one  and  only  one  free  variable,  x' 

Place 

—4  most  (Place) 

Ap'.Axi.most  (xi ,  x',  p'(x  1) ) 

River 

—4  most  (River) 

Ap'.Axi.most  (x\ ,  x',  p'(x  1) ) 

State 

-4  most  (State) 

Ap'.Axi.most  (xi ,  x' ,p'{x  1) ) 

City  - 

4  fewest  (City) 

Xp'  .Xx\.  fewest  (xi ,  x' ,  p'{x\) ) 

Place 

—4  fewest (Place) 

Xp' .Xx\.  fewest  (xi ,  x' ,  p'(x±) ) 

River 

—4  fewest  (River) 

Xp'.Xxi.  fewest  (x\ ,  x’ ,  p'(x\) ) 

State 

—4  fewest (State) 

Xp'.Xxi.  fewest  (xi ,  x' ,  p'(xi) ) 

City  — 4 

intersection  (CITY,  CITY) 

Xp1.Xp2.Xx1.  ( pi  (xi ) ,  P2  (xi ) ) 

Place  -4 

intersection (PLACE, PLACE) 

Ap1.Ap2.Ax1.  ( pi  (xi ) ,  P2  (xi ) ) 

River  — 4 

intersection  (RIVER,  RIVER) 

Ap1.Ap2.Ax1.  (pi(xi),p2(xi)) 

State  — > 

intersection  (STATE,  STATE) 

Ap1.Ap2.Ax1.  (pi(xi),p2(xi)) 

City  - 

4  exclude  (City,  City) 

Ap1.Ap2.Ax1.  (pi(xi),  not  ( P2 (xi ) )  ) 

Place 

—4  exclude (PLACE, PLACE) 

Ap1.Ap2.Ax1.  (pi(xi),  not  ( P2 (xi ) )  ) 

River 

—4  exclude  (River,  River) 

Ap1.Ap2.Ax1.  (pi(xi),  not  ( P2 (xi ) )  ) 

State 

—4  exclude  (State,  State) 

Ap1.Ap2.Ax1.  (pi(xi),  not  ( P2 (xi ) )  ) 
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Productions 

Corresponding  \-functions 

Query  - 

->  answer  (CITY) 

Ap.answer  (xi,p{xi)) 

Query  - 

->  answer  (COUNTRY) 

Ap.answer  (x\ ,  p(x i) ) 

Query  - 

-»  answer  (NUM) 

Ap.answer  (x\ ,  p(x i) ) 

Query  - 

->  answer  (PLACE) 

Ap.answer  (x\ ,  p(x i) ) 

Query  - 

->  answer  (RIVER) 

Ap.answer  (x\ ,  p(x i) ) 

Query  - 

->  answer  (STATE) 

Ap.answer  (x\ ,  p(x i) ) 

A.3  CLANG:  The  RoboCup  Coach  Language 

In  the  RoboCup  Coach  Competition,  teams  compete  to  provide  effective 
instructions  to  advice-taking  agents  in  the  simulated  soccer  domain.  Coaching  in¬ 
structions  are  provided  in  a  formal  coach  language  called  CLANG  (Chen  et  al., 
2003,  Sec.  7.7). 

The  CLANG  grammar  described  here  basically  follows  the  one  described 
in  Chen  et  al.  (2003).  We  have  slightly  modified  CLANG  to  introduce  a  few  con¬ 
cepts  that  are  not  easily  describable  in  the  original  CLANG  language.  These  new 
constructs  are  marked  with  asterisks  (*). 

In  CLANG,  coaching  instructions  come  in  the  form  of  if-then  rules.  Each 
if-then  rule  consists  of  a  condition  and  a  directive: 

Rule  — >•  (Condition  Directive) 


Possible  conditions  are: 


Productions 

Meaning  of  predicates 

Condition  - 

*  (true) 

Always  true. 

Condition  - 

(false) 

Always  false. 
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Productions 


Meaning  of  predicates 


Condition  — >  (ppos  Player 
Unumi  Unum2  Region) 


At  least  UNUMi  and  at  most  UNUM2  of 


Condition  - 

(ppos-any  Player  Region) 

Condition  - 

■+  (ppos-none  our  Region)  * 

Condition  - 

■»  (ppos-none  opp  Region)  * 

Condition  - 

■*  (bpos  Region) 

Condition  - 

■»  (bowner  PLAYER) 

Condition  - 

■*  (playm 

bko) 

Condition  - 

■»  (playm 

time-over) 

Condition  - 

■*  (playm 

play_on) 

Condition  - 

■»  (playm 

ko_our ) 

Condition  - 

■*  (playm 

ko_opp) 

Condition  - 

■»  (playm 

ki_our ) 

Condition  - 

■>  (playm 

ki_opp) 

Condition  - 

■>  (playm 

fk_our ) 

Condition  - 

■>  (playm 

fk_opp) 

Condition  - 

■>  (playm 

ck_our ) 

Condition  - 

■»  (playm 

ck_opp) 

Condition  - 

■>  (playm 

gk_our ) 

Condition  - 

■»  (playm 

gk_opp) 

Condition  - 

■>  (playm 

gc_our ) 

Condition  - 

■»  (playm 

gc_opp) 

Condition  - 

■>  (playm 

ag_our ) 

Condition  - 

■»  (playm 

ag_opp) 

Condition  - 

4  "Ident" 

Condition  - 

■>  (<  NUMi  Num2) 

Condition  - 

4-  (>  NUMi  Num2) 

Condition  - 

+  (<=  NUMi  NUM2) 

Condition  - 

4  (==  NUMi  Num2) 

Condition  - 

+  (>  =  NUMi  NUM2) 

Player  is  in  Region. 

Some  of  Player  is  in  Region. 

None  of  our  players  is  in  REGION. 

None  of  the  opponents  is  in  REGION. 
The  ball  is  in  REGION. 

Player  owns  the  ball. 

Specific  play  modes  (Chen  et  ah,  2003). 


Condition  named  Ident.  See  def  inec 
NUMi  is  smaller  than  NUM2.  Both 
Numi  and  Num2  can  be  identifiers. 
NUMi  is  greater  than  Num2. 

Numi  is  not  greater  than  Num2. 

Numi  is  equal  to  Num2. 

Numi  is  not  smaller  than  Num2. 
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Productions 

Meaning  of  predicates 

Condition  - 

4  (  !  =  NUMi  Num2) 

NUMi  is  not  equal  to  NUM2. 

Condition  - 

4  (and  CONDITIONi  CONDITION2) 

CONDITIONi  and  Condition2. 

Condition  - 

4  (or  CONDITIONi  CONDITION2) 

CONDITIONi  or  CONDITION2. 

Condition  - 

4  (not  Condition) 

Condition  is  not  true. 

Directives  are  lists  of  actions  for  individual  players  to  take: 


Directive 

Directive 


Productions 
(do  Player  Action) 
(dont  Player  Action) 


Meaning  of  predicates 
Player  should  take  Action. 
Player  should  avoid  taking  Action. 


Possible  actions  are: 


Productions 

Meaning  of  predicates 

Action  - 

-a  (pos  Region) 

Go  to  Region. 

Action  - 

-a  (home  Region) 

Set  default  position  to  Region. 

Action  - 

-a  (mark  PLAYER) 

Mark  PLAYER  (usually  opponents). 

Action  - 

-a  (markl  REGION) 

Mark  the  passing  lane  from  current  ball  position 

to  Region. 

Action  - 

-a  (markl  PLAYER) 

Mark  the  passing  lane  from  current  ball  position 
to  position  of  Player  (usually  opponents). 

Action  - 

-a  (oline  Region) 

Set  offside-trap  line  to  Region. 

Action  - 

(pass  Region) 

Pass  the  ball  to  Region. 

Action  - 

(pass  Player) 

Pass  the  ball  to  Player. 

Action  - 

(dribble  Region) 

Dribble  the  ball  to  Region. 

Action  - 

-a  (clear  REGION) 

Clear  the  ball  to  REGION. 

Action  - 

(shoot) 

Shoot  the  ball. 

Action  - 

-a  (hold) 

Hold  the  ball. 

Action  - 

(intercept) 

Intercept  the  ball. 

Action  - 

-a  (tackle  Player) 

Tackle  Player. 

The  following  productions  are  for  specifying  players:  (Unum  stands  for  “uniform 
numbers”,  i.e.  1  to  1 1 ) 
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Productions 

Player  — ►  (player  our  {Unum})** 
Player  — >  (player  our 
{Unumi  Unum2})  ** 

Player -a  (player  our 
{Unumi  Unum2  Unum3})  ** 

Player  — »  (player  our 

{Unumi  Unum2  Unum3  Unum4})  ** 
Player  — >  (player  opp  {Unum})** 
Player  — >  (player  our  {0})** 
Player  (player  opp  {0})** 
PLAYER  — >  (player-range  our 
Unumi  Unum2)  * 

PLAYER  — >  (player-range  opp 
Unumi  Unum2)  * 

Player  — >  (player-except  our 
{Unum})  * 

Player  — >  (player-except  opp 
{Unum})  * 


Meaning  of  predicates 
Our  player  Unum. 

Our  players  Unum4  and  Unum2. 

Our  players  Unumi,  Unum2  and 
Unum3. 

Our  players  Unum4,  Unum2,  Unum3 
and  Unum4. 

Opponent  player  Unum. 

Our  team. 

Opponent’s  team. 

Our  players  Unum4  to  Unum2. 

Opponent  players  Unum4  to  Unum2. 

Our  team  except  player  Unum 

Opponent’s  team  except  player  Unum 


Productions  marked  with  double  asterisks  (**)  are  slight  variations  of  existing  con¬ 
structs  in  the  original  CLANG  grammar  (e.g.  as  in  (bowner  our  {4} ) ).  The  new 
player  predicate  is  introduced  for  uniformity.  To  specify  regions,  we  can  use  the 
following  productions: 


Productions 

Meaning  of  predicates 

Region  — ► 

Point 

A  Point. 

Region  — ► 

(rec  POINTi  POINT2) 

A  rectangle  with  opposite  corners  POINTi  and 

Point2. 

Region  — ► 

(tri  POINTi  POINT2 

A  triangle  with  corners  PoiNTi,  Point2  and 

Point3) 

Point3. 
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Productions 


Region  — >  (arc  Point  Numi 
Num2  Num3  Num4) 

Region  — >  (circle  Point  Num)  * 
Region -a-  (null) 

Region  — >  (regREGiONi  Region2) 
Region  — >  (reg-exclude  Regioni 
Region2)  * 

Region  — >  (field) * 

Region (half  Team)  * 

Region -a  (penalty-area  Team)  * 
Region  — >  (goal-area  Team)  * 
Region  — >  (midfield)* 

Region  — >  (midfield  Team)  * 

REGION  — +  (near-goal-line  TEAM)  * 
Region  — >  (from-goal-line  Team 
Numi  Num2)  * 

Region— ►  (left  Region) * 

Region  — >  (right  Region)  * 

Region  — >  (left-quarter  Region)  * 
REGION  — >  (right-quarter  Region)  * 
Region  — >  "Ident" 


Meaning  of  predicates 
A  donut  arc  (Chen  et  al.,  2003). 

A  circle  of  center  POINT  and  radius  Num. 
The  empty  region. 

The  union  of  REGlONi  and  Region2. 
REGIONi  excluding  REGION2. 

The  field. 

The  Team’s  half  of  field.  Team  can  be 
either  our  or  opp. 

The  Team’s  penalty  area. 

The  Team’s  goal  area. 

The  midfield. 

The  Team’s  midfield. 

Near  Team’s  goal  line. 

Numi  to  Num2  meters  from  Team’s  goal 
line. 

The  left  half  of  Region  (from  our  team’s 
perspective). 

The  right  half  of  REGION. 

The  left  quarter  of  REGION. 

The  right  quarter  of  REGION. 

Region  named  Ident.  See  def  iner. 


To  specify  points,  we  can  use  the  following  productions: 


Productions 

Meaning  of  predicates 

Point 

->  (pt  NUMi  Num2) 

The  xy-coordinates  (NUMi,  Num2). 

Point 

->  (pt  ball) 

The  current  ball  position. 

Point 

->  POINTi  +  POINT2 

Coordinate- wise  addition. 

Point 

-a  POINTi  -  POINT2 

Coordinate-wise  subtraction. 

Point 

->  POINTi  *  POINT2 

Coordinate-wise  multiplication. 
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Productions 

Meaning  of  predicates 

Point  - 

a  POINTi  /  POINT2 

Coordinate-wise  division. 

Point  — >  (pt-with-ball-attraction 

POINTi  POINT2)  * 

POINTi  +  (  (pt  ball)  *  Point2). 

Point  - 

■*  ( f  ront-of-goal  Team)  * 

Directly  in  front  of  Team’s  goal. 

Point  - 

->  ( f  rom-goal  Team  Num)  * 

Num  meters  in  front  of  Team’s  goal. 

The  following  CLANG  statements  can  be  used  to  define  names  for  conditions  and 
regions.  These  names  (Ident)  can  be  used  to  simplify  the  definition  of  if-then 
rules: 


Statement^  (definec  "Ident"  Condition) 
Statement^  (definer  "Ident"  Region) 

Note  that  an  if-then  rule  is  also  a  CLANG  statement: 

Statement  — >  Rule 

Statement  is  the  start  symbol  in  the  CLang  grammar. 
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